De Duve - Critical Response II. Intuition, Logic, Intuition (1998)
ICP-TROP Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de...
-
Upload
phebe-haynes -
Category
Documents
-
view
214 -
download
1
Transcript of ICP-TROP Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de...
ICP-TROP
Molecular Evolution of Proteins Molecular Evolution of Proteins and Phylogenetic Analysisand Phylogenetic Analysis Fred R. OpperdoesFred R. Opperdoes
Christian de Duve Institute of Cellular Pathology Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Université (ICP) and Laboratory of Biochemistry, Université catholique de Louvain, Brussels, Belgiumcatholique de Louvain, Brussels, Belgium
ICP-TROP
Contents (1)Contents (1)
Arguments in favour of a phylogenetic analysis of the Arguments in favour of a phylogenetic analysis of the corresponding protein rather than the DNAcorresponding protein rather than the DNA
Codon biasCodon biasThe long time horizonThe long time horizonIntronsIntronsMultigene families Multigene families Protein is the unit of selectionProtein is the unit of selectionRNA editingRNA editing
ICP-TROP
Contents (2)Contents (2)
Methods for the Multiple Alignment of Protein SequencesMethods for the Multiple Alignment of Protein SequencesTwo sequencesTwo sequencesMultiple sequences (automatic)Multiple sequences (automatic)Manual alignmentManual alignment
Methods for the inference of protein phylogenyMethods for the inference of protein phylogenyDistance methodsDistance methodsMaximum parsimonyMaximum parsimonyReliability and rooting of treesReliability and rooting of trees
ICP-TROP
What is a phylogenetic tree What is a phylogenetic tree and what does it tell you?and what does it tell you?
A
B
C
D
E
F
G
H
I
OTUs
Root
External nodes
Internalnodes
A-E are external nodes (extant)F-I are internal (ancestral) nodes
OTUs are operational taxonomic unitsThey can be: species
populationsindividualsgenesproteinsThey are the extant (existing) OTUs
Internal nodes represent ancestralunits.
Topology: order of the nodes on the tree
ICP-TROP
Amitochondriates
Mitochondriates
Kinetoplastida
Eubacteria
Animals
Fungi
Eukaryota
Plants
Microsporidia
Diplomonads
Euglena
Parabasalia
Archaebacteria
Algae
Cilates
The ‘tree of life’ based on rRNA The ‘tree of life’ based on rRNA sequencessequences
ICP-TROP
Common ancestor?
Energy metabolism
Genetic machinery
The fusion hypothesis: the eukaryotic cell The fusion hypothesis: the eukaryotic cell is a chimaera of eubacterial and is a chimaera of eubacterial and
archaebacterial traitsarchaebacterial traits
Kinetoplastida
Eubacteria
Animals
Fungi
Eukaryota
Plants
Microsporidia
Diplomonads
Euglena
Parabasalia
Archaebacteria
Algae
Cilates
Root?
ICP-TROP 0.1
TPIS HUMANTPIS MACMUTPIS RABITTPIS MOUSETPIS RAT
TPIS LATCHTPIS CHICKTPIS SCHJA
TPIS SCHMATPIS AEDTOTPIS CULPITPIS CULTA
TPIS ANOMETPIS DROMETPIS HELVITPIS CAEEL
TPIS GRAVETPIS ARATH
TPIS PETHYTPIS COPJATPIS LACSA
TPIS HORVUTPIS SECCE
TPIS MAIZETPIS ORYSA
TPIC SPIOLTPIC SECCETPIS STELP
TPIS TRYBBTPIS TRYCRTPIS LEIME
TPI1 GIALATPI2 GIALA
TPIS EMENITPIS SCHPO
TPIS YEASTTPIS COPCI
TPIS BACSUTPIS STAAU
TPIS BACMETPIS BACSTTPIS LACDE
TPIS LACLATPIS CLOAB
TPIS BORBUTPIS SYNY3
TPIS PLAFATPIS MYCHR
TPIS MYCFLTPIS MYCHY
TPIS MYCGETPIS MYCPN
TPIS TREPATPIS MYCLE
TPIS MYCTUTPIS CORGL
TPIS STRCOTPIS XANFL
TPIS CHLAUTPIS RHIET
PGKT THEMATPIS AQUAE
TPIS VIBSATPIS PSESY
TPIS CHLPNTPIS CHLTR
TPIS ECOLITPIS ENTCL
TPIS HAEINTPIS VIBMA
TPIS BUCAPTPIS HELPJTPIS HELPY
TPIS FRATUTPIS MORSP TPIS PYRHO
TPIS PYRWOTPIS METTH
TPIS ARCFUTPIS METJA
TPIS METBR
Animalia
Planta
Protists
Fungi
Eubacteria
Archaebacteria
Triosephosphate Triosephosphate isomeraseisomerase
Triosephosphate isomerase of eukaryotes is of typical eubacterial origin and probably has entered the eukaryotic cell together with the bacterial endosymbiont that gave rise to the formation of the mitochondrion
Root?
ICP-TROP
What is requiredWhat is required
A DNA or protein sequenceA DNA or protein sequence A set of homologous sequencesA set of homologous sequences A good multiple sequence alignmentA good multiple sequence alignment Several programs to create a Several programs to create a
phylogenetic treephylogenetic tree
ICP-TROP
DNA or protein ?DNA or protein ?>TBTIM T.brucei TIM gene for microbody triosephosphate isomerase.>TBTIM T.brucei TIM gene for microbody triosephosphate isomerase.CTGCAGCAACTTACTGGGGACGCTGCTATCCTTTCTTCTTCATATTTCTCGTTTACCTACCTGCAGCAACTTACTGGGGACGCTGCTATCCTTTCTTCTTCATATTTCTCGTTTACCTACGTTTAGAGTCTCTGAGATCATTACTAGCAAGCAAACAAGAAGCCATTTGAGTTTCAAGCAGTTTAGAGTCTCTGAGATCATTACTAGCAAGCAAACAAGAAGCCATTTGAGTTTCAAGCAAAGTCTACCAAAAAACAAACTCTTATTATACCGTGCCAAATTATGTCCAAGCCACAACCCAAGTCTACCAAAAAACAAACTCTTATTATACCGTGCCAAATTATGTCCAAGCCACAACCCATCGCAGCAGCCAACTGGAAGTGCAACGGCTCCCAACAGTCTTTGTCGGAGCTTATTGATATCGCAGCAGCCAACTGGAAGTGCAACGGCTCCCAACAGTCTTTGTCGGAGCTTATTGATCTGTTTAACTCCACAAGCATCAACCACGACGTGCAATGCGTAGTGGCCTCCACCTTTGTTCTGTTTAACTCCACAAGCATCAACCACGACGTGCAATGCGTAGTGGCCTCCACCTTTGTTCACCTTGCCATGACGAAGGAGCGTCTTTCACACCCCAAATTTGTGATTGCGGCGCAGAACCACCTTGCCATGACGAAGGAGCGTCTTTCACACCCCAAATTTGTGATTGCGGCGCAGAACGCCATTGCAAAGAGCGGTGCCTTCACCGGCGAAGTCTCCCTGCCCATCCTCAAAGATTTCGCCATTGCAAAGAGCGGTGCCTTCACCGGCGAAGTCTCCCTGCCCATCCTCAAAGATTTCGGTGTCAACTGGATTGTTCTGGGTCACTCCGAGCGCCGCGCATACTATGGTGAGACAAACGGTGTCAACTGGATTGTTCTGGGTCACTCCGAGCGCCGCGCATACTATGGTGAGACAAACGAGATTGTTGCGGACAAGGTTGCCGCCGCCGTTGCTTCTGGTTTCATGGTTATTGCTTGCGAGATTGTTGCGGACAAGGTTGCCGCCGCCGTTGCTTCTGGTTTCATGGTTATTGCTTGCATCGGCGAAACGCTGCAGGAGCGTGAATCAGGTCGCACCGCTGTTGTTGTGCTCACACAGATCGGCGAAACGCTGCAGGAGCGTGAATCAGGTCGCACCGCTGTTGTTGTGCTCACACAGATCGCTGCTATTGCTAAGAAACTGAAGAAGGCTGACTGGGCCAAAGTTGTCATCGCCTACATCGCTGCTATTGCTAAGAAACTGAAGAAGGCTGACTGGGCCAAAGTTGTCATCGCCTACGAACCCGTTTGGGCCATTGGTACCGGCAAGGTGGCGACACCACAGCAAGCGCAGGAAGCCGAACCCGTTTGGGCCATTGGTACCGGCAAGGTGGCGACACCACAGCAAGCGCAGGAAGCCCACGCACTCATCCGCAGCTGGGTGAGCAGCAAGATTGGAGCAGATGTCGCGGGAGAGCTCCACGCACTCATCCGCAGCTGGGTGAGCAGCAAGATTGGAGCAGATGTCGCGGGAGAGCTCCGCATTCTTTACGGCGGTTCTGTTAATGGAAAGAATGCGCGCACTCTTTACCAACAGCGACGCATTCTTTACGGCGGTTCTGTTAATGGAAAGAATGCGCGCACTCTTTACCAACAGCGAGACGTCAACGGCTTCCTTGTTGGTGGTGCCTCACTTAAGCCAGAATTTGTGGACATCATCGACGTCAACGGCTTCCTTGTTGGTGGTGCCTCACTTAAGCCAGAATTTGTGGACATCATCAAAGCCACTCAGTGATTTTCCTTCATGTGTCAATGAGGTTTGGTGCTTTTGCCGTTGAGTAAAGCCACTCAGTGATTTTCCTTCATGTGTCAATGAGGTTTGGTGCTTTTGCCGTTGAGTGGGTGAAGATAGCGGTATATATATATATATATATATATATATATGCGCAAGTGAATATAAGGGTGAAGATAGCGGTATATATATATATATATATATATATATATGCGCAAGTGAATATAAAAAAGATGTAAAGACAGGTAGCAGGGAGAAAACCTCGCATAACATTATAAAAGGGAGTGTAAAAGATGTAAAGACAGGTAGCAGGGAGAAAACCTCGCATAACATTATAAAAGGGAGTGTAACTGGAGTGGGAAAACAAAGGAAAGGGGGATTCGTGTATTGAGCATATGAGAAAAAAAAAACTGGAGTGGGAAAACAAAGGAAAGGGGGATTCGTGTATTGAGCATATGAGAAAAAAAAAAGAAATTATGTTGTATGTTTTTACCTATAATTTATGCGAAGTGAATGACAAAACAAAAAAAGAAATTATGTTGTATGTTTTTACCTATAATTTATGCGAAGTGAATGACAAAACAAAAACCAAAAGGATATCATCATATGCTTTGTTTCATCCAAATGGTTGTTTCTTCCGTACCTCAGCCAAAAGGATATCATCATATGCTTTGTTTCATCCAAATGGTTGTTTCTTCCGTACCTCAGGGTCACTACTTCGTTGAGTGTGGTTTTAGCGAGGAGAGGGAACAATAGGGGGTGTTGTATGGTCACTACTTCGTTGAGTGTGGTTTTAGCGAGGAGAGGGAACAATAGGGGGTGTTGTATACATTTACACGTACGTATCTTCCTTTACTCTCTCTTGCCTTCATTATATTCCCCCTTTTTACATTTACACGTACGTATCTTCCTTTACTCTCTCTTGCCTTCATTATATTCCCCCTTTTTCTGGGAGAGGAAAAGAGAGTGTAGAATGAGGGGAGTACGTGTACGGAATTTTAACGATTACTGGGAGAGGAAAAGAGAGTGTAGAATGAGGGGAGTACGTGTACGGAATTTTAACGATTACCCCCTTTTTTTTCTTTGAACTATTATTTTTAGAATTCCCCCCTTTTTTTTCTTTGAACTATTATTTTTAGAATTC
>P04789|TPIS_TRYBB Triosephosphate isomerase, glycosomal (TIM) (Triose-phosphate isomerase)>P04789|TPIS_TRYBB Triosephosphate isomerase, glycosomal (TIM) (Triose-phosphate isomerase)MSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFMSKPQPIAAANWKCNGSQQSLSELIDLFNSTSINHDVQCVVASTFVHLAMTKERLSHPKFVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGVIAAQNAIAKSGAFTGEVSLPILKDFGVNWIVLGHSERRAYYGETNEIVADKVAAAVASGFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPFMVIACIGETLQERESGRTAVVVLTQIAAIAKKLKKADWAKVVIAYEPVWAIGTGKVATPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPQQAQEAHALIRSWVSSKIGADVAGELRILYGGSVNGKNARTLYQQRDVNGFLVGGASLKPEFVDIIKATQEFVDIIKATQ
ICP-TROP
The universal genetic codeThe universal genetic codeFirst Second Position Third First Second Position Third Position ------------------------------------ PositionPosition ------------------------------------ Position | U(T) C A G || U(T) C A G | U(T) Phe Ser Tyr Cys U(T)U(T) Phe Ser Tyr Cys U(T) Phe Ser Tyr Cys CPhe Ser Tyr Cys C Leu Ser STOP STOP ALeu Ser STOP STOP A Leu Ser STOP Trp GLeu Ser STOP Trp G
C Leu Pro His Arg U(T)C Leu Pro His Arg U(T) Leu Pro His Arg CLeu Pro His Arg C Leu Pro Gln Arg ALeu Pro Gln Arg A Leu Pro Gln Arg GLeu Pro Gln Arg G
A Ile Thr Asn Ser U(T)A Ile Thr Asn Ser U(T) Ile Thr Asn Ser CIle Thr Asn Ser C Ile Thr Lys Arg AIle Thr Lys Arg A Met Thr Lys Arg GMet Thr Lys Arg G
G Val Ala Asp Gly U(T)G Val Ala Asp Gly U(T) Val Ala Asp Gly CVal Ala Asp Gly C Val Ala Glu Gly AVal Ala Glu Gly A Val Ala Glu Gly GVal Ala Glu Gly G
ICP-TROP
CODON BIAS : CODON BIAS : 64 different possible triplet codes encode 20 amino acids. One 64 different possible triplet codes encode 20 amino acids. One
amino acid may be encoded by 1 to 6 different triplet codes, and amino acid may be encoded by 1 to 6 different triplet codes, and 3 of the 64 codes, called stop (or termination) codons, specify 3 of the 64 codes, called stop (or termination) codons, specify "end of peptide sequence" "end of peptide sequence"
The different codons are used with unequal frequency and this The different codons are used with unequal frequency and this distribution of frequency is referred to as "distribution of frequency is referred to as "codon usagecodon usage" "
Codon usage varies between species. Amino-acid codons have Codon usage varies between species. Amino-acid codons have been degenerated with been degenerated with wobblewobble in the third position. in the third position.
Arguments in favour of protein rather than Arguments in favour of protein rather than DNA sequencesDNA sequences
ICP-TROP
CODON BIAS : CODON BIAS : 64 different possible triplet codes encode 20 amino acids. One amino acid 64 different possible triplet codes encode 20 amino acids. One amino acid
may be encoded by 1 to 6 different triplet codes, and 3 of the 64 codes, may be encoded by 1 to 6 different triplet codes, and 3 of the 64 codes, called stop (or termination) codons, specify "end of peptide sequence" called stop (or termination) codons, specify "end of peptide sequence"
The different codons are used with unequal frequency and this distribution The different codons are used with unequal frequency and this distribution of frequency is referred to as "of frequency is referred to as "codon usagecodon usage" "
Codon usage varies between species. Amino-acid codons have been Codon usage varies between species. Amino-acid codons have been degenerated with degenerated with wobblewobble in the third position. in the third position.
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNAthan the DNA
ICP-TROP
The universal genetic codeThe universal genetic codeFirst Second Position Third First Second Position Third
Position ------------------------------------ PositionPosition ------------------------------------ Position
| U(T) C A G || U(T) C A G |
U(T) Phe Ser Tyr Cys U(T)U(T) Phe Ser Tyr Cys U(T)
Phe Ser Tyr Cys CPhe Ser Tyr Cys C
Leu Ser STOP STOP ALeu Ser STOP STOP A
Leu Ser STOP Trp GLeu Ser STOP Trp G
C Leu Pro His Arg U(T)C Leu Pro His Arg U(T)
Leu Pro His Arg CLeu Pro His Arg C
Leu Pro Gln Arg ALeu Pro Gln Arg A
Leu Pro Gln Arg GLeu Pro Gln Arg G
A Ile Thr Asn Ser U(T)A Ile Thr Asn Ser U(T)
Ile Thr Asn Ser CIle Thr Asn Ser C
Ile Thr Lys Arg AIle Thr Lys Arg A
Met Thr Lys Arg GMet Thr Lys Arg G
G Val Ala Asp Gly U(T)G Val Ala Asp Gly U(T)
Val Ala Asp Gly CVal Ala Asp Gly C
Val Ala Glu Gly AVal Ala Glu Gly A
Val Ala Glu Gly GVal Ala Glu Gly G
ICP-TROP
Yeasts, protozoa, and animals have different codon preferences,
This would result in differences in DNA sequence related to codon bias and not to evolution.
Arguments in favour of ... (codon bias 2)Arguments in favour of ... (codon bias 2)
ICP-TROP
Different species use different Different species use different codonscodons
Homo sapiens [gbmam]: 1 CDS's (389 codons)Homo sapiens [gbmam]: 1 CDS's (389 codons)
--------------------------------------------------------------------------------------------------------------------------------------------------------
fields: [triplet] [frequency: per thousand] ([number])fields: [triplet] [frequency: per thousand] ([number])
--------------------------------------------------------------------------------------------------------------------------------------------------------
UUU 20.6( 8) UCU 5.1( 2) UAU 7.7( 3) UGU 7.7( 3)UUU 20.6( 8) UCU 5.1( 2) UAU 7.7( 3) UGU 7.7( 3)
UUC 12.9( 5) UCC 20.6( 8) UAC 30.8( 12) UGC 0.0( 0)UUC 12.9( 5) UCC 20.6( 8) UAC 30.8( 12) UGC 0.0( 0)
UUA 10.3( 4) UCA 18.0( 7) UAA 0.0( 0) UGA 0.0( 0)UUA 10.3( 4) UCA 18.0( 7) UAA 0.0( 0) UGA 0.0( 0)
UUG 10.3( 4) UCG 0.0( 0) UAG 2.6( 1) UGG 15.4( 6)UUG 10.3( 4) UCG 0.0( 0) UAG 2.6( 1) UGG 15.4( 6)
Saccharomyces cerevisiae [gbpln]: 9295 CDS's (4586264 codons)----------------------------------------------------------------------------fields: [triplet] [frequency: per thousand] ([number])----------------------------------------------------------------------------
UUU 25.9(118900) UCU 23.6(108308) UAU 18.7( 85651) UGU 8.0( 36624)UUC 18.3( 83880) UCC 14.3( 65421) UAC 14.7( 67599) UGC 4.6( 21255)UUA 26.3(120698) UCA 18.7( 85618) UAA 1.0( 4476) UGA 0.6( 2742)UUG 27.2(124967) UCG 8.5( 39137) UAG 0.4( 2058) UGG 10.4( 47694)
ICP-TROP
Differences between the “Universal” and Differences between the “Universal” and Mitochondrial Genetic CodesMitochondrial Genetic Codes
CodonCodon Universal code Universal code mitochondrial code mitochondrial code
UGAUGA StopStop TrpTrp
AGAAGA ArgArg StopStop
AGGAGG ArgArg StopStop
AUAAUA IleIle MetMet
Modified from: Li and Graur, 1991, Fundamentals of Molecular Evolution , Modified from: Li and Graur, 1991, Fundamentals of Molecular Evolution ,
Sinauer PublSinauer Publ..
ICP-TROP
Arguments in favour... (codon bias)Arguments in favour... (codon bias)
Also, the protozoa use the codons Also, the protozoa use the codons TAA TAA and and TGATGA to to encode encode glutamineglutamine, rather than STOP, rather than STOP
In mitochondria the codon In mitochondria the codon TGATGA encodes encodes tryptophanetryptophane, , rather than STOP rather than STOP
The inclusion of unique codons in a subset of the The inclusion of unique codons in a subset of the sequences will tend to make that subset appear more sequences will tend to make that subset appear more divergent than they really aredivergent than they really are
ICP-TROP
Arguments in favour... (codon bias 2)Arguments in favour... (codon bias 2)
High GC content of DNA seems to be associated with High GC content of DNA seems to be associated with aerobiosis in prokaryotes (Naya et al., 2002)aerobiosis in prokaryotes (Naya et al., 2002)
In all major groups both organisms with AT rich and GC In all major groups both organisms with AT rich and GC rich DNA can be found.rich DNA can be found.
The inclusion of unique codons in a subset of the The inclusion of unique codons in a subset of the sequences will tend to make that subset appear more sequences will tend to make that subset appear more divergent than they really aredivergent than they really are
ICP-TROP
GC content of DNA in aerobic and GC content of DNA in aerobic and anaerobic prokaryotesanaerobic prokaryotes
Anaerobic
Aerobic
From Naya et al., J. Mol. Evol. 55 (2002) 260-264
ICP-TROP
The use of protein sequences in The use of protein sequences in phylogeny requires knowledge of phylogeny requires knowledge of the properties of the amino acids the properties of the amino acids
and their single letter codesand their single letter codes
ICP-TROP
AlanineAlanine AA LeucineLeucine LL
ArginineArginine RR LysineLysine KK
AsparagineAsparagine NN MethionineMethionine MM
Aspartic acidAspartic acid DD PhenylalaninePhenylalanine FF
CysteineCysteine CC ProlineProline PP
Glutamic acidGlutamic acid EE SerineSerine SS
GlutamineGlutamine QQ ThreonineThreonine TT
GlycineGlycine GG TryptophaneTryptophaneWW
HistidineHistidine HH TyrosineTyrosine YY
IsoleucineIsoleucine II ValineValine VV
The use of protein sequences in phylogeny The use of protein sequences in phylogeny requires knowledge of the properties of the requires knowledge of the properties of the
amino acids and their single letter codes amino acids and their single letter codes
ICP-TROP
LONG TIME HORIZON : LONG TIME HORIZON :
When comparing sequences that have diverged for When comparing sequences that have diverged for possibly a billion years or more, it is very likely that the possibly a billion years or more, it is very likely that the wobble bases in the codons will have become randomized. wobble bases in the codons will have become randomized. By excluding the wobble bases (a general technique), one By excluding the wobble bases (a general technique), one is actually looking at amino acid sequences.is actually looking at amino acid sequences.
So why not taking a protein sequence directly?So why not taking a protein sequence directly?
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNAthan the DNA
ICP-TROP
Advantages of the translation Advantages of the translation of DNA into protein (1)of DNA into protein (1)
DNA is composed of only four kinds of unit: A, G, C and TDNA is composed of only four kinds of unit: A, G, C and T
If gaps are not allowed, on the average, 25% of residues in two If gaps are not allowed, on the average, 25% of residues in two randomly chosen aligned sequences would be identicalrandomly chosen aligned sequences would be identical
If gaps are allowed, as much as 50 % of residues in two randomly If gaps are allowed, as much as 50 % of residues in two randomly chosen aligned sequences can be identical. Such a situation chosen aligned sequences can be identical. Such a situation may obscure any genuine relationship that may exist. Especially may obscure any genuine relationship that may exist. Especially when comparing distantly related or rapidly evolving gene when comparing distantly related or rapidly evolving gene sequencessequences
Moreover, it is easier to translate a gene sequence into its Moreover, it is easier to translate a gene sequence into its corresponding protein than to remove the third wobble base from corresponding protein than to remove the third wobble base from each of the codons in the geneeach of the codons in the gene
ICP-TROP
Alignment of two random DNA Alignment of two random DNA sequencessequences
Without indels19% identity
Indels allowed56% identity
ICP-TROP
Translation of DNA into 21 different types of codon (20 amino acids and Translation of DNA into 21 different types of codon (20 amino acids and a terminator) allows the information to sharpen up considerably. Wrong a terminator) allows the information to sharpen up considerably. Wrong frame information is set asideframe information is set aside
Third-base degeneracies are consolidatedThird-base degeneracies are consolidated
After insertion of gaps to align two random protein sequences it can be After insertion of gaps to align two random protein sequences it can be expected that they are between 10-20% identicalexpected that they are between 10-20% identical
As a result of the translation procedure the protein sequences with their As a result of the translation procedure the protein sequences with their 20 amino acids are much more easy to align than the corresponding 20 amino acids are much more easy to align than the corresponding DNA sequences with only 4 nucleotidesDNA sequences with only 4 nucleotides
Advantages of the translation Advantages of the translation of DNA into protein (2)of DNA into protein (2)
ICP-TROP
Alignment of two random Alignment of two random protein sequencesprotein sequences
Without indels 7% identity
Indels allowed22% identity
ICP-TROP
If, after this, you still want to align distantly If, after this, you still want to align distantly related gene sequences, you better prepare first related gene sequences, you better prepare first a protein alignment and then base yourself on a protein alignment and then base yourself on this alignment for the alignment of the gene this alignment for the alignment of the gene sequences and the precise placement of indels sequences and the precise placement of indels in the aligned sequences.in the aligned sequences.
Conclusion: The signal to noise ratio is greatly Conclusion: The signal to noise ratio is greatly improved when using protein sequences over improved when using protein sequences over DNA sequences!DNA sequences!
Advantages of the translation Advantages of the translation of DNA into protein (3)of DNA into protein (3)
ICP-TROP
TBLASTNTBLASTN
The blast algorithm TBLASTN allows The blast algorithm TBLASTN allows the use of translated protein sequence the use of translated protein sequence information to search for distant information to search for distant relationships between genes relationships between genes
A protein sequence is compared with all A protein sequence is compared with all the translated sequences from a the translated sequences from a nucleotide databasenucleotide database
ICP-TROP
Nature of Sequence Nature of Sequence Divergence in proteinsDivergence in proteins
The observed sequence difference of two diverging The observed sequence difference of two diverging sequences takes the course of a negative exponential. sequences takes the course of a negative exponential. This is the result of the fact that each position is subject to This is the result of the fact that each position is subject to reverse changes ("back mutations") and multiple hitsreverse changes ("back mutations") and multiple hits
Thus the observed percentage of difference between the Thus the observed percentage of difference between the protein sequences is not proportional to the actual protein sequences is not proportional to the actual evolutionary difference between two homologous evolutionary difference between two homologous sequencessequences
The evolutionary distance between two proteins is The evolutionary distance between two proteins is expressed in PAM units. PAM (Dayhoff and Eck, 1968) expressed in PAM units. PAM (Dayhoff and Eck, 1968) stands for "accepted point mutation"stands for "accepted point mutation"
ICP-TROP
Relation between % Relation between % distance and PAM distance distance and PAM distance PAM PAM DistanceDistance
valuevalue (%) (%)
80 80 5050
100100 6060
200200 7575
250250 8585 Twilight zone Twilight zone
300300 92 92
(From Doolittle, 1987, Of URFs and ORFs, University Science Books)(From Doolittle, 1987, Of URFs and ORFs, University Science Books)
As the evolutionary distance increases, the probability of super-As the evolutionary distance increases, the probability of super-imposed mutations becomes greater resulting in a lower observed imposed mutations becomes greater resulting in a lower observed percent difference. percent difference.
ICP-TROP
Relation between % Relation between % distance and PAM distance distance and PAM distance
Distance %
4003002001000
510152025303540455055606570758085
Pam value
Twilight zone
ICP-TROP
The Kimura correction for The Kimura correction for multiple substitutionsmultiple substitutions
The formula used to correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) :
K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is corrected distance.
This formula gives mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged.
This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues).
ICP-TROP
Proteins evolve at highly Proteins evolve at highly different ratesdifferent rates
Rate of Change TheoreticalPAMs / 108 yrs Lookback Time
Pseudogenes 400 45 x 106 yrsFibrinopeptides 90 200 "Lactalbumins 27 670 "Lysozymes 24 850 "Ribonucleases 21 850 "Haemoglobins 12 1500 "Acid proteases 8 2300 "Cytochrome c 4 5000 "Glyceraldehyde-P dehydrogenase 2 9000 "Glutamate dehydrogenase 1 18000 "
PAM = number of Accepted Point Mutations per 100 amino acids. Useful lookback time = 360 PAMs
ICP-TROP
Some Important Dates in Some Important Dates in HistoryHistory
EventEvent Number of years agoNumber of years ago
Origin of the UniverseOrigin of the Universe 15 ± 415 ± 4 101099 yrs yrs
Formation of the Solar SystemFormation of the Solar System 4.64.6 " "
First Self-replicating SystemFirst Self-replicating System 3.5 ± 0.5 3.5 ± 0.5 " "
Prokaryotic-Eukaryotic DivergenceProkaryotic-Eukaryotic Divergence 2.0 ± 0.52.0 ± 0.5 " "
Plant-Animal Divergence Plant-Animal Divergence ~1.0 ~1.0 " "
Invertebrate-Vertebrate DivergenceInvertebrate-Vertebrate Divergence0.50.5 " "
Mammalian RadiationMammalian Radiation Beginning ~ 0.1Beginning ~ 0.1 " "
From Doolittle, Of URFs and ORFs, 1987From Doolittle, Of URFs and ORFs, 1987
ICP-TROP
Construction of a phylogenetic tree from Construction of a phylogenetic tree from phosphoglycerate kinase sequencesphosphoglycerate kinase sequences
0.1
Rat
Mouse
Human
Horse
Drosophila
Schistosoma
Kluyveromyces
Yeast
Neurospora
Yarli
Plasmodium
Leishmania
Crithidia
Trypanosoma 1
Trypanosoma 2
Wheat
Zymomonas
Escherichia
Methanobacter
Bacillus
Human Mouse Horse DrosophilaSchistosoma Wheat Yarli Yeast NeurosporaKluyveromycesPlasmodiumTrypanosoma Crithidia LeishmaniaBacillus Escherichia Mycobacter Zymomonas Methanobacter
G L D C G P E S S K K Y A E A V T R A K Q I V W N G PG L D C G T E S S K K Y A E A V G R A K Q I V W N G PG L D C G T E S S K K Y A E A V A R A K Q I V W N G PG L D V G P K T R E L F A A P I A R A K L I V W N G PG L D I G P K T I E E F S K V I S R A K T I V W N G PG L D I G P D S V K T F N D A L D T T Q T I I W N G PG L D C G P K S I E E F Q K V I G E S K T I L W N G PG L D N G P E S R K L F A A T V A K A K T I V W N G PG L D C G E E S V K L F T Q A I N E S Q T I L W N G PG L D N G P E S R K A F A A T V A E A K T I V W N G PG L D A G P K S I E N Y K D V I L T S K T V I W N G PA L D I G P K T I E K Y V Q T I G K C K S A I W N G PA L D I G P K T I K I Y E D V I A K C K S T I W N G PA L D I G P R T I H M Y E E V I G R C K S A I W N G PA L D I G P K T R E L Y R D V I R E S K L V V W N G PI L D I G D A S A Q E L A E I L K N A K T I L W N G PS L D V G S K T I A L F E S Y L K T A K T I F W N G PI L D V G P K A V A A L T E V L K A S K T L V W N G PI Y D I G T N T I T E Y A K F I R D A K T I F A N G P
ICP-TROP
INTRONS : INTRONS : A study of the evolution of a protein using its DNA A study of the evolution of a protein using its DNA
sequence should only include coding sequences sequence should only include coding sequences
This requires that in every DNA sequence all the This requires that in every DNA sequence all the introns are being edited out. This may be introns are being edited out. This may be cumbersome and time consumingcumbersome and time consuming
An easier approach would be the direct translation of An easier approach would be the direct translation of the cDNA sequence into its corresponding protein the cDNA sequence into its corresponding protein sequencesequence
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNA (3)than the DNA (3)
ICP-TROP
Typical structure of a Typical structure of a eukaryotic geneeukaryotic gene
TATA box
Transcription initiation
Initiation codon
Stop codon
AATAA
Poly (A)addition site
Exon 1 Exon 2 Exon 3 Flanking regionFlanking region
5' 3'
Intron I Intron II
ICP-TROP
MULTIGENE FAMILIES :MULTIGENE FAMILIES : Organisms may contain many highly similar genes, Organisms may contain many highly similar genes,
while only one peptide sequence can be identified (e.g. while only one peptide sequence can be identified (e.g. histones, tubulins and GAPDH in humans). histones, tubulins and GAPDH in humans).
Using these DNA sequences, it would be difficult to Using these DNA sequences, it would be difficult to decide which are expressed and which not and thus decide which are expressed and which not and thus which genes to include in the analysis. which genes to include in the analysis.
Moreover, if all the genes that are expressed encode Moreover, if all the genes that are expressed encode the same protein, then DNA differences are not the same protein, then DNA differences are not significantsignificant
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNA (4)than the DNA (4)
ICP-TROP
PROTEIN IS THE UNIT OF SELECTION :PROTEIN IS THE UNIT OF SELECTION :
For protein-encoding genes, the object on which For protein-encoding genes, the object on which natural selection acts is the protein itself. natural selection acts is the protein itself.
The underlying DNA sequence reflects this process in The underlying DNA sequence reflects this process in combination with species-specific pressures on DNA combination with species-specific pressures on DNA sequence (like the need for aerophiles to have DNA sequence (like the need for aerophiles to have DNA that is GC richer). that is GC richer).
If function demands that a protein maintains a specific If function demands that a protein maintains a specific sequence, there still is room for the DNA sequence to sequence, there still is room for the DNA sequence to change. change.
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNA (5)than the DNA (5)
ICP-TROP
RNA EDITING :RNA EDITING : The DNA sequence doesn't always translate into amino The DNA sequence doesn't always translate into amino
acid sequence. acid sequence.
In post-translational editing non-coded amino acids are In post-translational editing non-coded amino acids are added or coded amino acids are removed in the editing added or coded amino acids are removed in the editing process. process.
This could lead to major differences in DNA sequence This could lead to major differences in DNA sequence (sometimes more than 50%) that nevertheless leads to (sometimes more than 50%) that nevertheless leads to roughly the same protein sequence after final editingroughly the same protein sequence after final editing
Arguments in favour of a phylogenetic Arguments in favour of a phylogenetic analysis of the corresponding protein rather analysis of the corresponding protein rather
than the DNA (6)than the DNA (6)
ICP-TROP
Pan-editing of mitochondrial Pan-editing of mitochondrial RNA in KinetoplastidaRNA in Kinetoplastida
UCCuAuuA*AuUUUUUGuUA**UAuAGuuuuuuAA*UGUUGuuuGGuGuA*uuuuuuuAuUG*UGuuuAGuuuuGuuuuGuuGuuGuuuGuuuG****GUGuGuuAuuG**UUUUGAGAuuGuuGnote that the mature mRNA would not be able to hybridise with the gene present in the kinetoplast DNA and thus cannot be detected as such.
ICP-TROP
Some good advice (1)Some good advice (1)
It is recommended to prepare the phylogenetic trees It is recommended to prepare the phylogenetic trees both ways (DNA and Protein) and see how they lookboth ways (DNA and Protein) and see how they look
For a group of species that are relatively close in time For a group of species that are relatively close in time and closely related (like viral proteins or vertebrate and closely related (like viral proteins or vertebrate enzymes), DNA-based analysis is probably a good enzymes), DNA-based analysis is probably a good way to go, since you avoid problems of codon bias way to go, since you avoid problems of codon bias and randomization of wobble bases. But check the and randomization of wobble bases. But check the protein anyway protein anyway
ICP-TROP
Some good advice (2)Some good advice (2)
Be aware of the problems of multigene families (for instance Be aware of the problems of multigene families (for instance coding for isoenzymes) coding for isoenzymes)
Be careful when you decide to exclude or include such Be careful when you decide to exclude or include such sequences (you may compare paralogous rather than sequences (you may compare paralogous rather than orthologous sequences)orthologous sequences)
ICP-TROP
Text available from: Text available from: [email protected] [email protected]
Text and slides:Text and slides: http://www.icp.be/~opperd/chapter8/http://www.icp.be/~opperd/chapter8/
Website:Website: http://www.icp.be/~opperd/private/proteins.htmlhttp://www.icp.be/~opperd/private/proteins.html
ICP-TROP
For the creation of a phylogenetic tree a good alignment For the creation of a phylogenetic tree a good alignment of protein sequences is of vital importanceof protein sequences is of vital importance
Only homologous residues should be aligned with each Only homologous residues should be aligned with each otherother
Doubtful regions should not be included in the alignmentDoubtful regions should not be included in the alignment
Aligned sequences should have similar lengthsAligned sequences should have similar lengths
Alignment of two protein Alignment of two protein sequences (1)sequences (1)
ICP-TROP
Dot-Matrix plotsDot-Matrix plots
Two homologous sequences with 81% identity Two homologous sequences with 50% identity
ICP-TROP
Pair-wise alignment of two protein Pair-wise alignment of two protein sequences according to the ‘Dot-Matrix’ sequences according to the ‘Dot-Matrix’
methodmethodC D E G L D P G S E R K
CDEGLDPGSERK
••
••
••
••
••
••
•
•
•
••
•
C D E P L D P G S Q R K
CDEGLDPGSERK
••
•
••
••
•
••
•
•
••
•
C D E L D P G S Q R K
CDEGLDPGSERK
••
•
••
••
•
••
•
•
••
•
C D E D G L S Q L K
CDEGLDPLSERK
••
•
•
••
•
•
••
•
•
•
•
A B
C D
ICP-TROP
Alignment requires the user to make assumptions Alignment requires the user to make assumptions regarding relative costs of substitution versus insertions regarding relative costs of substitution versus insertions and deletions (indels).and deletions (indels).
If substitution cost >> gap penalty: there will be many If substitution cost >> gap penalty: there will be many short gaps and no phylogenetic information.short gaps and no phylogenetic information.
In general: search for maximum identity and minimize In general: search for maximum identity and minimize the number of insertions and deletions.the number of insertions and deletions.
Exclude regions that cannot be aligned unambiguously! Exclude regions that cannot be aligned unambiguously! Visual alignment is possible using the "dot-matrix Visual alignment is possible using the "dot-matrix
method"method"
Alignment of two protein Alignment of two protein sequences (2)sequences (2)
ICP-TROP
Identity matrix as used in Identity matrix as used in ClustalClustalC10,C10,
S 0, 10,S 0, 10,
T 0, 0, 10,T 0, 0, 10,
P 0, 0, 0, 10,P 0, 0, 0, 10,
A 0, 0, 0, 0, 10,A 0, 0, 0, 0, 10,
G 0, 0, 0, 0, 0, 10,G 0, 0, 0, 0, 0, 10,
N 0, 0, 0, 0, 0, 0, 10,N 0, 0, 0, 0, 0, 0, 10,
D 0, 0, 0, 0, 0, 0, 0, 10,D 0, 0, 0, 0, 0, 0, 0, 10,
E 0, 0, 0, 0, 0, 0, 0, 0, 10,E 0, 0, 0, 0, 0, 0, 0, 0, 10,
Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
Y 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,Y 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
W 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,W 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
C S T P A G N D E Q H R K M I L V F Y WC S T P A G N D E Q H R K M I L V F Y W
ICP-TROP
Distance matrix withDistance matrix withmutation costs for amino acidsmutation costs for amino acids
A S G L K V T P E D N I Q R F Y C H M W Z B XA S G L K V T P E D N I Q R F Y C H M W Z B X
Ala = A 0 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2Ala = A 0 1 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
Ser = S 1 0 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2Ser = S 1 0 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 2 2 1 2 2 2
Gly = G 1 1 0 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2Gly = G 1 1 0 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2 1 2 2 2
Leu = L 2 1 2 0 2 1 2 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 2Leu = L 2 1 2 0 2 1 2 1 2 2 2 1 1 1 1 2 2 1 1 1 2 2 2
Lys = K 2 2 2 2 0 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2Lys = K 2 2 2 2 0 2 1 2 1 2 1 1 1 1 2 2 2 2 1 2 1 2 2
Val = V 1 2 1 1 2 0 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2Val = V 1 2 1 1 2 0 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2
Thr = T 1 1 2 2 1 2 0 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2Thr = T 1 1 2 2 1 2 0 1 2 2 1 1 2 1 2 2 2 2 1 2 2 2 2
Pro = P 1 1 2 1 2 2 1 0 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2Pro = P 1 1 2 1 2 2 1 0 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2
Glu = E 1 2 1 2 1 1 2 2 0 1 2 2 1 2 2 2 2 2 2 2 1 2 2Glu = E 1 2 1 2 1 1 2 2 0 1 2 2 1 2 2 2 2 2 2 2 1 2 2
Asp = D 1 2 1 2 2 1 2 2 1 0 1 2 2 2 2 1 2 1 2 2 2 1 2Asp = D 1 2 1 2 2 1 2 2 1 0 1 2 2 2 2 1 2 1 2 2 2 1 2
Asn = N 2 1 2 2 1 2 1 2 2 1 0 1 2 2 2 1 2 1 2 2 2 1 2Asn = N 2 1 2 2 1 2 1 2 2 1 0 1 2 2 2 1 2 1 2 2 2 1 2
Ile = I 2 1 2 1 1 1 1 2 2 2 1 0 2 1 1 2 2 2 1 2 2 2 2Ile = I 2 1 2 1 1 1 1 2 2 2 1 0 2 1 1 2 2 2 1 2 2 2 2
Gln = Q 2 2 2 1 1 2 2 1 1 2 2 2 0 1 2 2 2 1 2 2 1 2 2Gln = Q 2 2 2 1 1 2 2 1 1 2 2 2 0 1 2 2 2 1 2 2 1 2 2
Arg = R 2 1 1 1 1 2 1 1 2 2 2 1 1 0 2 2 1 1 1 1 2 2 2Arg = R 2 1 1 1 1 2 1 1 2 2 2 1 1 0 2 2 1 1 1 1 2 2 2
Phe = F 2 1 2 1 2 1 2 2 2 2 2 1 2 2 0 1 1 2 2 2 2 2 2Phe = F 2 1 2 1 2 1 2 2 2 2 2 1 2 2 0 1 1 2 2 2 2 2 2
Tyr = Y 2 1 2 2 2 2 2 2 2 1 1 2 2 2 1 0 1 1 3 2 2 1 2Tyr = Y 2 1 2 2 2 2 2 2 2 1 1 2 2 2 1 0 1 1 3 2 2 1 2
Cys = C 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 0 2 2 1 2 2 2Cys = C 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 1 0 2 2 1 2 2 2
His = H 2 2 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 0 2 2 2 1 2His = H 2 2 2 1 2 2 2 1 2 1 1 2 1 1 2 1 2 0 2 2 2 1 2
Met = M 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 3 2 2 0 2 2 2 2Met = M 2 2 2 1 1 1 1 2 2 2 2 1 2 1 2 3 2 2 0 2 2 2 2
Trp = W 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 0 2 2 2Trp = W 2 1 1 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 0 2 2 2
Glx = Z 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2Glx = Z 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2
Asx = B 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2Asx = B 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1 2
??? = X 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2??? = X 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
The distance table is generated by calculating the minimum number of base mutations required The distance table is generated by calculating the minimum number of base mutations required to convert an amino acid in row i to an amino acid in column j. Note Met->Tyr is the only to convert an amino acid in row i to an amino acid in column j. Note Met->Tyr is the only change that requires all 3 codon positions to change.change that requires all 3 codon positions to change.
ICP-TROP
Hydrophobicity matrixHydrophobicity matrix R K D E B Z S N Q G X T H A C M P V L I Y F WR K D E B Z S N Q G X T H A C M P V L I Y F W
Arg = R 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0Arg = R 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0
Lys = K 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0Lys = K 10 10 9 9 8 8 6 6 6 5 5 5 5 5 4 3 3 3 3 3 2 1 0
Asp = D 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1Asp = D 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1
Glu = E 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1Glu = E 9 9 10 10 8 8 7 6 6 6 5 5 5 5 5 4 4 4 3 3 3 2 1
Asx = B 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3Asx = B 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3
Glx = Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3Glx = Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7 6 6 6 5 5 5 4 4 3
Ser = S 6 6 7 7 8 8 10 10 10 10 9 9 9 9 8 8 7 7 7 7 6 6 4Ser = S 6 6 7 7 8 8 10 10 10 10 9 9 9 9 8 8 7 7 7 7 6 6 4
Asn = N 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4Asn = N 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4
Gln = Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4Gln = Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 7 7 7 6 6 4
Gly = G 5 5 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 8 7 7 6 6 5Gly = G 5 5 6 6 8 8 10 10 10 10 9 9 9 9 8 8 8 8 7 7 6 6 5
??? = X 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5??? = X 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5
Thr = T 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5Thr = T 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 8 8 8 8 7 7 5
His = H 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5His = H 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5
Ala = A 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5Ala = A 5 5 5 5 7 7 9 9 9 9 10 10 10 10 9 9 9 8 8 8 7 7 5
Cys = C 4 4 5 5 6 6 8 8 8 8 9 9 9 9 10 10 9 9 9 9 8 8 5Cys = C 4 4 5 5 6 6 8 8 8 8 9 9 9 9 10 10 9 9 9 9 8 8 5
Met = M 3 3 4 4 6 6 8 8 8 8 9 9 9 9 10 10 10 10 9 9 8 8 7Met = M 3 3 4 4 6 6 8 8 8 8 9 9 9 9 10 10 10 10 9 9 8 8 7
Pro = P 3 3 4 4 6 6 7 8 8 8 8 8 9 9 9 10 10 10 9 9 9 8 7Pro = P 3 3 4 4 6 6 7 8 8 8 8 8 9 9 9 10 10 10 9 9 9 8 7
Val = V 3 3 4 4 5 5 7 7 7 8 8 8 8 8 9 10 10 10 10 10 9 8 7Val = V 3 3 4 4 5 5 7 7 7 8 8 8 8 8 9 10 10 10 10 10 9 8 7
Leu = L 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8Leu = L 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8
Ile = I 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8Ile = I 3 3 3 3 5 5 7 7 7 7 8 8 8 8 9 9 9 10 10 10 9 9 8
Tyr = Y 2 2 3 3 4 4 6 6 6 6 7 7 7 7 8 8 9 9 9 9 10 10 8Tyr = Y 2 2 3 3 4 4 6 6 6 6 7 7 7 7 8 8 9 9 9 9 10 10 8
Phe = F 1 1 2 2 4 4 6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10 9Phe = F 1 1 2 2 4 4 6 6 6 6 7 7 7 7 8 8 8 8 9 9 10 10 9
Trp = W 0 0 1 1 3 3 4 4 4 5 5 5 5 5 6 7 7 7 8 8 8 9 10Trp = W 0 0 1 1 3 3 4 4 4 5 5 5 5 5 6 7 7 7 8 8 8 9 10
Hydrophobicity scoring matrix constructed from hydrophilicity data (M.Levitt, J. Mol. Hydrophobicity scoring matrix constructed from hydrophilicity data (M.Levitt, J. Mol. Biol. 104, 59 [1976]), derived by George et al. 1990, Mutation Data Matrix and Its Biol. 104, 59 [1976]), derived by George et al. 1990, Mutation Data Matrix and Its Uses, Methods in Enzymology 183, 333.Uses, Methods in Enzymology 183, 333.
ICP-TROP
1 PAM evolutionary distance1 PAM evolutionary distance
Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr ValAla Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val
A R N D C Q E G H I L K M F P S T W Y VA R N D C Q E G H I L K M F P S T W Y V
Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18Ala A 9867 2 9 10 3 8 17 21 2 6 4 2 6 2 22 35 32 0 2 18
Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1Arg R 1 9913 1 0 1 10 0 0 10 3 1 19 4 1 4 6 1 8 0 1
Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1Asn N 4 1 9822 36 0 4 6 6 21 3 1 13 0 1 2 20 9 1 4 1
Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1Asp D 6 0 42 9859 0 6 53 6 4 1 0 3 0 0 1 5 3 0 0 1
Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2Cys C 1 1 0 0 9973 0 0 0 1 1 0 0 0 0 1 5 1 0 3 2
Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1Gln Q 3 9 4 5 0 9876 27 1 23 1 3 6 4 0 6 2 2 0 0 1
Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2Glu E 10 0 7 56 0 35 9865 4 2 3 1 4 1 0 3 4 2 0 1 2
Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5Gly G 21 1 12 11 1 3 7 9935 1 0 1 2 1 1 3 21 3 0 0 5
His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1His H 1 8 18 3 1 20 1 0 9912 0 1 1 0 2 3 1 1 1 4 1
Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33Ile I 2 2 3 1 2 1 2 0 0 9872 9 2 12 7 0 1 7 0 1 33
Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15Leu L 3 1 3 0 0 6 1 1 4 22 9947 2 45 13 3 1 3 4 2 15
Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1Lys K 2 37 25 6 0 12 7 2 2 4 1 9926 20 0 3 8 11 0 1 1
Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4Met M 1 1 0 0 0 2 0 0 0 5 8 4 9874 1 0 1 2 0 0 4
Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0Phe F 1 1 1 0 0 0 0 1 2 8 6 0 4 9946 0 2 1 3 28 0
Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2Pro P 13 5 2 1 1 8 3 2 5 1 2 2 1 1 9926 12 4 0 0 2
Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2Ser S 28 11 34 7 11 4 6 16 2 2 1 7 4 3 17 9840 38 5 2 2
Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9Thr T 22 2 13 4 1 3 2 2 1 11 2 8 6 1 5 32 9871 0 2 9
Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0Trp W 0 2 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 9976 1 0
Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1Tyr Y 1 0 3 0 3 0 1 0 4 1 1 0 0 21 0 1 1 2 9945 1
Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901Val V 13 2 1 1 3 2 2 3 3 57 11 1 17 1 3 2 10 0 2 9901
[top row shows original amino acid; left column shows replacement amino acid][top row shows original amino acid; left column shows replacement amino acid]
Mutation probability matrix for the evolutionary distance of 1 PAM (i.e., one Accepted Point Mutation per 100 amino acids).Mutation probability matrix for the evolutionary distance of 1 PAM (i.e., one Accepted Point Mutation per 100 amino acids).
An element of this matrix, [Mij], gives the probability that the amino acid in column j will be replaced by the amino acid inAn element of this matrix, [Mij], gives the probability that the amino acid in column j will be replaced by the amino acid in
row i after a given evolutionary interval, in this case 1 PAM. Thus, there is a 0.56% probability that Asp will be replaced byrow i after a given evolutionary interval, in this case 1 PAM. Thus, there is a 0.56% probability that Asp will be replaced by
Glu. To simplify the appearance, the elements are shown multiplied by 10,000. (Adapted from Figure 82. Atlas of ProteinGlu. To simplify the appearance, the elements are shown multiplied by 10,000. (Adapted from Figure 82. Atlas of Protein
Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)
PAM 1 mutation matrix PAM 1 mutation matrix
ICP-TROP
PAM 100 matrix as used in PAM 100 matrix as used in ClustalClustalC 14,C 14,
S -1, 6,S -1, 6,
T -5, 2, 7,T -5, 2, 7,
P -6, 1, -1, 10,P -6, 1, -1, 10,
A -5, 2, 2, 1, 6,A -5, 2, 2, 1, 6,
G -8, 1, -3, -3, 1, 8,G -8, 1, -3, -3, 1, 8,
N -8, 2, 0, -3, -1, -1, 7,N -8, 2, 0, -3, -1, -1, 7,
D -11, -1, -2, -4, -1, -1, 4, 8,D -11, -1, -2, -4, -1, -1, 4, 8,
E -11, -2, -3, -3, 0, -2, 1, 5, 8,E -11, -2, -3, -3, 0, -2, 1, 5, 8,
Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,
H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,
R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1, 10,R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1, 10,
K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3, 3, 8,K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3, 3, 8,
M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7, -2, 1, 13,M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7, -2, 1, 13,
I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7, -4, -4, 2, 9,I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7, -4, -4, 2, 9,
L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5, -7, -6, 4, 2, 9,L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5, -7, -6, 4, 2, 9,
V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6, -6, -6, 1, 5, 1, 8,V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6, -6, -6, 1, 5, 1, 8,
F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4, -7,-11, -2, 0, 0, -5, 12,F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4, -7,-11, -2, 0, 0, -5, 12,
Y -2, -6, -6,-11, -6,-11, -3, -9, -7, -9, -1,-10,-10, -8, -4, -5, -6, 6, 13,Y -2, -6, -6,-11, -6,-11, -3, -9, -7, -9, -1,-10,-10, -8, -4, -5, -6, 6, 13,
W -13, -4,-10,-11,-11,-13, -8,-13,-14,-11, -7, 1, -9,-11,-12, -7,-14, -2, -2, 19,W -13, -4,-10,-11,-11,-13, -8,-13,-14,-11, -7, 1, -9,-11,-12, -7,-14, -2, -2, 19,
C S T P A G N D E Q H R K M I L V F Y WC S T P A G N D E Q H R K M I L V F Y W
ICP-TROP
PAM 250 matrix as used in PAM 250 matrix as used in ClustalClustal
C 12,C 12,
S 0, 2,S 0, 2,
T -2, 1, 3,T -2, 1, 3,
P -3, 1, 0, 6,P -3, 1, 0, 6,
A -2, 1, 1, 1, 2,A -2, 1, 1, 1, 2,
G -3, 1, 0,-1, 1, 5,G -3, 1, 0,-1, 1, 5,
N -4, 1, 0,-1, 0, 0, 2,N -4, 1, 0,-1, 0, 0, 2,
D -5, 0, 0,-1, 0, 1, 2, 4,D -5, 0, 0,-1, 0, 1, 2, 4,
E -5, 0, 0,-1, 0, 0, 1, 3, 4,E -5, 0, 0,-1, 0, 0, 1, 3, 4,
Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2, 6,L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2, 6,
V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4, 2, 4,V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4, 2, 4,
F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1, 2,-1, 9,F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1, 2,-1, 9,
Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4, 0,-4,-4,-2,-1,-1,-2, 7,10,Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4, 0,-4,-4,-2,-1,-1,-2, 7,10,
W -8,-2,-5,-6,-6,-7,-4,-7,-7,-5,-3, 2,-3,-4,-5,-2,-6, 0, 0,17, W -8,-2,-5,-6,-6,-7,-4,-7,-7,-5,-3, 2,-3,-4,-5,-2,-6, 0, 0,17,
C S T P A G N D E Q H R K M I L V F Y WC S T P A G N D E Q H R K M I L V F Y W
ICP-TROP
Matrices often used for the Matrices often used for the alignment of proteinsalignment of proteins
PAM 250 (Dayhoff et al., 1978)PAM 250 (Dayhoff et al., 1978) BLOSUM62 (Henikoff-Henikoff, 1992)BLOSUM62 (Henikoff-Henikoff, 1992) JTT (Jones et al., 1992)JTT (Jones et al., 1992) mtREV24 (Adachi-Hasegawa, 1996)mtREV24 (Adachi-Hasegawa, 1996) GONNET matrix (Gonnet et al., 1992)GONNET matrix (Gonnet et al., 1992)
ICP-TROP
Multiple alignment of protein Multiple alignment of protein sequencessequences
For the construction of reliable phylogenetic trees the quality of a For the construction of reliable phylogenetic trees the quality of a multiple alignment is of the utmost importancemultiple alignment is of the utmost importance
There are many programs available for the multiple alignment of There are many programs available for the multiple alignment of proteins. proteins. – A good program in the public domain is: A good program in the public domain is: ClustalWClustalW – A similar program is A similar program is PileupPileup of the GCG package of the GCG package
They quickly align sequence pairs and roughly determine the degrees They quickly align sequence pairs and roughly determine the degrees of identity between each pairof identity between each pair
Then the sequences are aligned more precisely in a progressive way Then the sequences are aligned more precisely in a progressive way starting with the two closest sequencesstarting with the two closest sequences
Most programs work best when the sequences have similar length.Most programs work best when the sequences have similar length.
ICP-TROP
Some rules of thumb for the Some rules of thumb for the manual alignment of proteins (1)manual alignment of proteins (1)
An automatically produced multiple alignment often An automatically produced multiple alignment often needs manual adjustment to improve the quality of needs manual adjustment to improve the quality of the alignment. the alignment.
Such improvement can be obtained by using all the Such improvement can be obtained by using all the knowledge that is available about a protein. knowledge that is available about a protein.
If a structure is available you should use the detailed If a structure is available you should use the detailed information about secondary structure for the information about secondary structure for the alignment. alignment.
ICP-TROP
The rules for mutation of amino acids are dependent The rules for mutation of amino acids are dependent on their physicochemical properties.on their physicochemical properties.
SurfaceSurface residues ( residues (DRENKDRENK) are preferably mutated to ) are preferably mutated to residues of similar properties. Since they are not, or residues of similar properties. Since they are not, or less, involved in protein folding they mutate rather less, involved in protein folding they mutate rather easily.easily.
HydrophobicHydrophobic residues ( residues (FAMILYVWFAMILYVW) are preferentially ) are preferentially replaced by other hydrophobic ones. These residues replaced by other hydrophobic ones. These residues are mainly internal and determine the folding of the are mainly internal and determine the folding of the protein. They thus mutate rather slowly.protein. They thus mutate rather slowly.
Some rules of thumb for the Some rules of thumb for the manual alignment of proteins (2)manual alignment of proteins (2)
ICP-TROP
The residues The residues CHQSTCHQST are indifferent and may be are indifferent and may be replaced with any other type of residuereplaced with any other type of residue
The residues (The residues (DRENKDRENKCHQSTCHQST), when conserved ), when conserved throughout the alignment are very likely residues that throughout the alignment are very likely residues that are involved in the active site. So the multiple are involved in the active site. So the multiple alignment should be adjusted accordinglyalignment should be adjusted accordingly
Periodicity of charged residues may provide Periodicity of charged residues may provide information as to the presence of elements of information as to the presence of elements of secondary structure such as secondary structure such as -helices and -helices and -strands-strands
Some rules of thumb for the Some rules of thumb for the manual alignment of proteins (3)manual alignment of proteins (3)
ICP-TROP
-helix-helix
ICP-TROP
-strand-strand
ICP-TROP
IndelsIndels (insertions/deletions) are never found in (insertions/deletions) are never found in elements of secondary structure but only in loops. elements of secondary structure but only in loops.
ProPro and and GlyGly interfere with secondary structure interfere with secondary structure elements and thus have a preference for loopselements and thus have a preference for loops
HydrophobicityHydrophobicity (or hydropathy) profiles according to (or hydropathy) profiles according to Kyte and Doolittle of two homologous proteins are in Kyte and Doolittle of two homologous proteins are in general strikingly similargeneral strikingly similar
Some rules of thumb for the Some rules of thumb for the manual alignment of proteins (4)manual alignment of proteins (4)
ICP-TROP
Proline interferes with Proline interferes with -helix -helix and and -sheet formation-sheet formation
From Deber and Therien,2002
ICP-TROP
Possible functions for proline Possible functions for proline in trans-membrane domainsin trans-membrane domains
From Deber and Therien,2002
ICP-TROP
Alignment of malate dehydrogenase sequencesAlignment of malate dehydrogenase sequences
Slcl|CHR34_tmp.0150 ----MKPST--LSRFKVTVLGASGAIGQPLALALVQNKRVSEL-----ALYDIVQPR---lcl|CHR34_tmp.0140 ----MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYVKEL-----KLYDVKGGP---lcl|CHR34_tmp.0130 MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAADDDVPGS---lcl|CHR28_tmp.0050 -----------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRLLDIEPALKAL . . :*.: *.:. :* .* : . : : *
lcl|CHR34_tmp.0150 -GVAVDLSHFPRKVKVTGYPTKWIHK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNALlcl|CHR34_tmp.0140 -GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIPAGIPRKPGMT-RDDLFNTNASlcl|CHR34_tmp.0130 -GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLREDRDIALKAAAPlcl|CHR28_tmp.0050 AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-RKDLLEMNAR *: .:*.. . . .. : :: . . ::. :: *
lcl|CHR34_tmp.0150 TVNELSAAVARYAPKSV-LAIISNPLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRARlcl|CHR34_tmp.0140 IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKKVGVYDPARLFGVTTLDVVRARlcl|CHR34_tmp.0130 TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFGVTTLDVIRTRlcl|CHR28_tmp.0050 IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTAMTRLDHNRAL .. *:. .. : :: .* *: . . : : * :* :: .: *: *:
lcl|CHR34_tmp.0150 KMLGDFTGQDPEMLDVPVIGGHSGQTIVPLFSHS--GVELRQEQVEYLTHRVR-------lcl|CHR34_tmp.0140 TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPSLSEEQVRQLTHRIQ-------lcl|CHR34_tmp.0130 KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISGEVQSYGVLFElcl|CHR28_tmp.0050 SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDALDDD-----FV .::. : :: * . * * : : : .
lcl|CHR34_tmp.0150 --VGGD-EVVKAKEGRGSSSLSMAFAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCRlcl|CHR34_tmp.0140 --FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALSGEKGVVVCTYVQS-TVEPSCAlcl|CHR34_tmp.0130 AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALVES-TMRSETPlcl|CHR28_tmp.0050 QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSDENPYGVPSGL . . . . * . : : * : :. .
lcl|CHR34_tmp.0150 FFGSTVEVCKEGIERVLPLPPLNEYEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPSTlcl|CHR34_tmp.0140 FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQSN-ITKGIAFSNK---------lcl|CHR34_tmp.0130 FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAATQF--------lcl|CHR28_tmp.0050 IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL-------------- :*. . * : . .: : : * : *: :
ICP-TROP
Hydrophobicity profilesHydrophobicity profiles Profiles according to Kyte and Doolittle of homologous proteins are in Profiles according to Kyte and Doolittle of homologous proteins are in
general strikingly similar and may provide a tool in the alignment of general strikingly similar and may provide a tool in the alignment of two or more proteins. two or more proteins.
The two phosphoglycerate kinase sequences below share 50% The two phosphoglycerate kinase sequences below share 50% identical residues.identical residues.
Trypanosoma congolense PGK Euglena gracilis PGK
ICP-TROP
Tree construction methods (1)Tree construction methods (1) Distance matrix methodsDistance matrix methods
– Cluster analysis (UPGMA, WPGMA, etc)Cluster analysis (UPGMA, WPGMA, etc)– Fitch & Margoliash (1967)Fitch & Margoliash (1967)– Transformed distance methods (eg. Li, 1981)Transformed distance methods (eg. Li, 1981)– Neighbor-joiningNeighbor-joining (Saitou & Nei, 1987) (Saitou & Nei, 1987)– ...many more...many more
Parsimony methodsParsimony methods– Maximum parsimonyMaximum parsimony
Other methodsOther methods– Maximum likelihoodMaximum likelihood (Felsenstein, 1981) (Felsenstein, 1981)– ... many more... many more
ICP-TROP
Character-based methods: Character-based methods: – maximum parsimonymaximum parsimony – maximum likelihoodmaximum likelihood
Non-character-based methods: Non-character-based methods: – distance matrix methodsdistance matrix methods
Tree construction methods (2)Tree construction methods (2)
ICP-TROP
Phylogeny (2)Phylogeny (2) Distance Matrix methods (in the public domain)Distance Matrix methods (in the public domain)
– Least squares method (Fitch and Margoliash)Least squares method (Fitch and Margoliash) —Fitch, KitschFitch, Kitsch of the Phylip package (Jo Felsentein, Univ. Washington) of the Phylip package (Jo Felsentein, Univ. Washington)
– Neighbor-joining methodNeighbor-joining method—NeighborNeighbor of the Phylip package (Jo Felsentein, Univ. Washington) of the Phylip package (Jo Felsentein, Univ. Washington) —ClustalClustal, or , or DistnjDistnj in Protml package (Adachi and Hasegawa, Univ. in Protml package (Adachi and Hasegawa, Univ.
Tokyo)Tokyo)—DarwinDarwin (Gaston Gonner, ETH, Zurich, via mailserver or WWW) (Gaston Gonner, ETH, Zurich, via mailserver or WWW)
Protein Maximum likelihood (in the public domain)Protein Maximum likelihood (in the public domain)– ProtmlProtml (Adachi and Hasegawa, Univ. Tokyo) (very cpu intensive) (Adachi and Hasegawa, Univ. Tokyo) (very cpu intensive)– TreePuzzleTreePuzzle (Strimmer and von Haeseler, 1997) (Strimmer and von Haeseler, 1997)
Protein maximal parsimony (in the public domain)Protein maximal parsimony (in the public domain)— ProtparsProtpars (Jo Felsentein, Univ. Washington) (Jo Felsentein, Univ. Washington) — PaupPaup (David Swofford, latest version will be commercial) (David Swofford, latest version will be commercial)
ICP-TROP
Some useful information Some useful information about phylogenetic treesabout phylogenetic trees
A
B
C
D
E
F
G
H
I
OTUs
Root
External nodes
Internalnodes
A-E are external nodes (extant)F-I are internal (ancestral) nodes
OTUs are operational taxonomic unitsThey can be: species
populationsindividualsgenesproteinsThey are the extant (existing) or extinct
(ancestral) OTUs
Topology: order of the nodes on the tree
ICP-TROP
Distance Matrix MethodsDistance Matrix Methods UPGMAUPGMA (Unweighted Pair Group with Arithmatic Mean) uses real (Unweighted Pair Group with Arithmatic Mean) uses real
(uncorrected) distance values and a sequential clustering (uncorrected) distance values and a sequential clustering algorithm. (Should only be used with closely related OTUs, or algorithm. (Should only be used with closely related OTUs, or when there is constancy of evolutionary rate)when there is constancy of evolutionary rate)
Transformed distance methodsTransformed distance methods. Corrections may be introduced . Corrections may be introduced to obtain trees with true evolutionary distances (PAM values, to obtain trees with true evolutionary distances (PAM values, Kimura), or corrections are carried out with reference to an Kimura), or corrections are carried out with reference to an outgroup (Farris, 1971; Klotz et al, 1979). Should be used when outgroup (Farris, 1971; Klotz et al, 1979). Should be used when evolutionary distant organisms are included in the datasetevolutionary distant organisms are included in the dataset
Neighbors relation methodsNeighbors relation methods
– FITCH (Fitch, 1981)FITCH (Fitch, 1981)
– Neighbor-Joining method, (Saitou and Nei, 1987) Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above) distance Should all be used with corrected (see above) distance matricesmatrices
ICP-TROP
Distance matrixDistance matrixUncorrected for Multiple Substitutions
1 2 3 4 5 0.00 0.63 0.63 22.88 18.50 AC007866_13 1 0.00 0.63 22.57 18.50 AC007866_17 2 0.00 22.88 17.87 AC007866_15 3 0.00 5.64 AC007866_9 4 0.00 AC007866_11 5Using the Kimura correction methodGap weighting is 0.000000
1 2 3 4 5 0.00 0.63 0.63 27.35 21.29 AC007866_13 1 0.00 0.63 26.90 21.29 AC007866_17 2 0.00 27.35 20.47 AC007866_15 3 0.00 5.88 AC007866_9 4 0.00 AC007866_11 5
Distance matrix as produced by the EMBOSS program distmat
ICP-TROP
UPGMAUPGMA
UPGMAUPGMA (Unweighted Pair (Unweighted Pair Group with Arithmetic Mean) Group with Arithmetic Mean) uses real (uncorrected) uses real (uncorrected) distance values and a distance values and a sequential clustering sequential clustering algorithm. (Should only be algorithm. (Should only be used with closely related used with closely related OTUs, or when there is OTUs, or when there is constancy of evolutionary constancy of evolutionary rate)rate)
ICP-TROP
Tree construction (UPGMA)Tree construction (UPGMA)
First cycle
A B C D E
B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8
Cluster the pair of OTUs with the smallest distance, being A and B, The branching point is positioned at a distance of 2 / 2 = 1 substitution.
ICP-TROP
Following the first clustering A and B are considered as a single Following the first clustering A and B are considered as a single composite OTU(A,B) and we now calculate the new distance matrix composite OTU(A,B) and we now calculate the new distance matrix as follows:as follows:
dist(A,B),C = (distAC + distBC) / 2 = 4dist(A,B),C = (distAC + distBC) / 2 = 4
dist(A,B),D = (distAD + distBD) / 2 = 6dist(A,B),D = (distAD + distBD) / 2 = 6
dist(A,B),E = (distAE + distBE) / 2 = 6dist(A,B),E = (distAE + distBE) / 2 = 6
dist(A,B),F = (distAF + distBF) / 2 = 8dist(A,B),F = (distAF + distBF) / 2 = 8
In other words the distance between a simple OTU and a composite In other words the distance between a simple OTU and a composite OTU is the average of the distances between the simple OTU and OTU is the average of the distances between the simple OTU and the constituent simple OTUs of the composite OTU. Then a new the constituent simple OTUs of the composite OTU. Then a new distance matrix is recalculated using the newly calculated distances distance matrix is recalculated using the newly calculated distances and the whole cycle is being repeated:and the whole cycle is being repeated:
Tree construction (UPGMA)Tree construction (UPGMA)
ICP-TROP
Tree construction (UPGMA)Tree construction (UPGMA)
Second cycleSecond cycle
A,BA,B C C D D E E
CC 4 4
DD 6 6 6 6
EE 6 6 6 6 44
FF 8 8 8 8 8 8 8 8
ICP-TROP
Tree construction (UPGMA)Tree construction (UPGMA)
Third cycleThird cycle
A,BA,B C C D,E D,E
CC 44
D,ED,E 6 6 6 6
FF 8 8 8 8 8 8
ICP-TROP
Tree construction (UPGMA 4)Tree construction (UPGMA 4)
Fourth cycleFourth cycle
AB,CAB,C D,E D,E
D,ED,E 66
FF 8 8 8 8
ICP-TROP
Tree construction (UPGMA)Tree construction (UPGMA)
Fifth cycleFifth cycle
ABC,DEABC,DE
FF 88
The final step consists of clustering the last OTU, The final step consists of clustering the last OTU, F,with the composite OTU.F,with the composite OTU.
ICP-TROP
Pitfalls of UPGMAPitfalls of UPGMA
The UPGMA clustering method is very The UPGMA clustering method is very sensitive to unequal evolutionary rates. sensitive to unequal evolutionary rates.
Clustering works only if the data are Clustering works only if the data are ultrametric ultrametric
Ultrametric distances are defined by the Ultrametric distances are defined by the satisfaction of the 'three-point condition'.satisfaction of the 'three-point condition'.
ICP-TROP
The treepoint conditionThe treepoint condition For any three taxa: dist AC <= max (distAB, distBC) or, For any three taxa: dist AC <= max (distAB, distBC) or, in words: the two greatest distances are equal, or in words: the two greatest distances are equal, or UPGMA assumes that the evolutionary rate is the same for UPGMA assumes that the evolutionary rate is the same for
all branchesall branches If the assumption of rate constancy among lineages does If the assumption of rate constancy among lineages does
not hold UPGMA may give an erroneous topology.not hold UPGMA may give an erroneous topology.
Non-ultrametric tree
ICP-TROP
Unequal rates of mutation Unequal rates of mutation lead to wrong treeslead to wrong trees
UPGMA tree construction based on the data of the UPGMA tree construction based on the data of the left tree would result in the erroneous tree at the left tree would result in the erroneous tree at the rightright
ICP-TROP
UPGMA (conclusion)UPGMA (conclusion)
UPGMA uses real (uncorrected) distance values and UPGMA uses real (uncorrected) distance values and a sequential clustering algorithm. a sequential clustering algorithm.
This method of tree construction is very sensitive to This method of tree construction is very sensitive to differences in branch length or unequal rates of differences in branch length or unequal rates of evolution. evolution.
It should only be used with closely related OTUs, or It should only be used with closely related OTUs, or when there is constancy of evolutionary rate. when there is constancy of evolutionary rate.
The method is often used in combination with The method is often used in combination with isoenzyme or restriction site data or with isoenzyme or restriction site data or with morphological criteria morphological criteria
ICP-TROP
Use Use sequence informationsequence information rather than distance rather than distance informationinformation
Calculate for Calculate for all possible treesall possible trees the tree that represents the tree that represents the the minimum number of substitutionsminimum number of substitutions at each informative at each informative sitesite
Maximum Parsimony MethodsMaximum Parsimony Methods
ICP-TROP
Maximum Parsimony analysis (2)
Parsimony implies that simpler hypotheses are preferable to more complicated ones.
Maximum parsimony is a character-based method that infers a phylogenetic tree by minimizing the total number of evolutionary steps required to explain a given set of data, or in other words by minimizing the total tree length.
The steps may be base or amino-acid substitutions for sequence data, or gain and loss events for restriction site data.
ICP-TROP
Maximum parsimony, when applied to protein sequence data either considers each site of the sequence as a multistate unordered characterd with 20 possible states (the amino-acids) (Eck and Dayhoff, 1966), or may take into account the genetic code and the number of mutations, 1, 2 or 3, that is required to explain an observed amino-acid substitution. The latter method is implemented in the PROTPARS program (Felsenstein, 1993).
The maximum parsimony method searches all possible tree topologies for the optimal (minimal) tree. However, the number of unrooted trees that have to be analysed rapidly increases with the number of OTUs.
Maximum Parsimony analysis (3)
ICP-TROP
The number of rooted trees (Nr) for n OTUs is given by:Nr = (2n -3)!/(2exp(n -2)) (n -2)!
The number of unrooted trees (Nr) for n OTUs is given by:Nu = (2n -5)!/(2exp(n -3)) (n -3)!
Maximum Parsimony analysis (4)
Number of OTUs unrooted trees rooted trees 2 1 1 3 1 3 4 3 15 5 15 105 6 105 945 7 954 10,395 8 10,395 135,135 9 135,135 34,459,425 10 34,459,425 2.13E15 15 2.13E15 8.E21
This rapid increase in number of trees to be analysed may make it impossible to apply the method to very large datasets. In that case the parsimony method may become very time consuming, even on very fast computers.
ICP-TROP
maximum parsimony method for 4 nucleic-acid sequences
Site _________________________ Sequence 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G
For four OTUs there are three possible unrooted trees. The trees are then analysed by searching for the ancestral sequences and by counting the number of mutations required to explain the respective trees :
ICP-TROP
(1) AAGAGTGCA AGATATCCA (3) \4 2/ Number of mutations \ 4 / AGCCGTGCG --- AGAGATCCG Tree I: 11 / \ /0 0\ (2) AGCCGTGCG AGAGATCCG (4)
(1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGGAGTGCA --- AGAGGTCCG Tree II: 14 / \ /4 1\ (3) AGATATCCA AGAGATCCG (4)
(1) AAGAGTGCA AGCCGTGCG (2) \1 3/ \ 5 / AGAAGTGCA --- AGATGTCCG Tree III: 16 / \ /5 2\ (4) AGAGATCCG AGATATCCA (3)
Tree I has the topology with the least number of mutations and thus is the most parsimonious tree.
Ancestral trees are calculated
This analysis includes both informative and non-informative sites in the sequence.
When only informative sites are included a much lesser number of sites can be analysed, which means in the case of large datasets a considerable gain in CPU time.
ICP-TROP
Informative sitesInformative sitesA site is informative only when there are at least two different kinds of nucleotides at the site, each of which is represented in at least two of the sequences under study.
Site _________________________ Sequence 1 2 3 4 5 6 7 8 9
1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G * * *
Informative sites are indicated by an asterisk (*)
ICP-TROP
1 GGA 2 GGG 3 ACA 4 ACG ***
(1) GGA ACA (3) \1 1/ Number of mutations \ 2 / GGG --- ACG Tree I: 4 / \ /0 0\ (2) GGG ACG (4)
(1) GGA GGG (2) \1 1/ \ 1 / GCA --- GCG Tree II: 5 / \ /1 1\ (3) ACA ACG (4)
(1) GGA GGG (2) \2 1/ \ 0 / GCG --- GCG Tree III: 6 / \ /1 2\ (4) ACG ACA (3)
To infer a maximum parsimony tree, for each possible tree we calculate the minimum number of substitutions at each informative site. In the above example, for sites 5, 7, and 9, tree I requires in total 4 changes, tree II requires 5 changes, and tree III requires 6 changes. In the final step, we sum the number of changes over all the informative sites for each tree and choose the tree associated with the smallest number of substitutions. In our case, tree I is chosen because it requires the smallest number of changes (4) at the informative sites.
Informative sites onlyInformative sites only
ICP-TROP
How to find the best tree ?How to find the best tree ? Maximum parsimony searches for the optimal (minimal) tree. In this process more
than one minimal trees may be found. In order to guarantee to find the best possible tree an exhaustive evaluation of all possible tree topologies has to be carried out. However, this becomes impossible when there are more than 12 OTUs in a dataset.
Branch and Bound: is a variation on maximum parsimony that garantees to find the minimal tree without having to evaluate all possible trees. This way a larger number of taxa can be evaluated but the method is still limited.
Heuristic searches is a method with step-wise addition and rearrangement (branch swapping) of OTUs. Here it is not guaranteed to find the best tree.
Since, in view of the size of the dataset, it is often not possible to carry out an exhaustive or other search for the best tree, it is adviced to change the order of the taxa in the dataset and to repeat the analysis, or to indicate to the program to do this for you by providing a so-called jumble factor to the program.
ICP-TROP
Consensus tree Since the Maximum Parsimony method may result in more than one equally
parsimonious tree, a consensus tree should be created. For the creation of a consensus tree see bootstrapping.
ICP-TROP
Parsimony and branch lengthsParsimony and branch lengths(1) G A (3) \1 0/ \ 1 / C -----A / \ /0 1\ (2) C T (4)
(1) G A (3) \0 1/ \ 1 / G -----T / \ /1 0\ (2) C T (4)
(1) G A (3) \1 1/ \ 1 / C -----A / \ /0 0\ (2) C A (4)
3 possible trees for 4 OTUs, all describe the same final state by assuming a total of 3 steps.
Each final state is arrived at via a different route.
Each of the three trees is equally valid, but the number of steps along the indiviual branches (or the length of each branch) is not determined.
For this reason branch lengths are not given in parsimony, but only the total number of steps for a tree.
ICP-TROP
Some final notes on maximum parsimony
Maximum Parsimony (positive points): – is based on shared and derived characters. It therefore is a cladistic rather
than a phenetic method – does not reduce sequence information to a single number – tries to provide information on the ancestral sequences – evaluates different trees
Maximum Parsimony (negative points): – does not assume an evolutionary model– is slow in comparison with distance methods – does not use all the sequence information (only informative sites are used) – does not correct for multiple mutations (does not imply a model of
evolution) – does not provide information on the branch lengths – is notorious for its sensitivity to codon bias
ICP-TROP
How to root an unrooted tree?How to root an unrooted tree? The majority of methods yield unrooted treesThe majority of methods yield unrooted trees To root a tree one should add an outgroup to the dataset. An outgroup is To root a tree one should add an outgroup to the dataset. An outgroup is
an OTU for which external information (eg. paleontological information) is an OTU for which external information (eg. paleontological information) is available that indicates that the outgroup branched off before all other taxa available that indicates that the outgroup branched off before all other taxa
Do not choose an outgroup that is very distantly related to your taxa. This Do not choose an outgroup that is very distantly related to your taxa. This may result in serious topolocical errorsmay result in serious topolocical errors
Do not choose either an outgroup that is too closely related to the taxa in Do not choose either an outgroup that is too closely related to the taxa in question. In this case it may not be a true outgroupquestion. In this case it may not be a true outgroup
The use of more than one outgroup generally improves the estimate of tree The use of more than one outgroup generally improves the estimate of tree topologytopology
In the absence of a good outgroup the root may be positioned by assuming In the absence of a good outgroup the root may be positioned by assuming approximately equal evolutionary rates over all the branches. In this way approximately equal evolutionary rates over all the branches. In this way the root is put at the midpoint of the longest pathway between two OTUsthe root is put at the midpoint of the longest pathway between two OTUs
ICP-TROP
Maximum likelihood It evaluates a hypothesis about evolutionary history in terms of the
probability that the proposed model and the hypothesized history would give rise to the observed data set. A history with a higher probability of reaching the observed state is preferred to a history with a lower probability. The method searches for the tree with the highest probability or likelihood.
The following programs are available from the web:– DNAML (DNA data only. By Joe Felsenstein in the Phylip package) – FastDNAML (DNA data only. A faster algorithm applied by Gary
Olsen to Joe Felsenstein's DNAML program ) – ProtML (DNA and protein. By Adachi and Hasegawa, 1992) – TreePuzzle (DNA and protein. By Strimmer and von Haeseler, 1995).
This program applies a heuristic method and is much faster than PROTML, but does not guarantee to find the best tree.
ICP-TROP
Advantages and disadvantages of the maximum likelihood method
There are some supposed adavantages of maximum likelihood methods over other methods.
– It is the estimation method least affected by sampling error – It is robust to many violations of the assumptions in the evolutionary model – with very short sequences it tends to outperform alternative methods such
as parsimony or distance methods. – the method is statistically well founded – evalutates different tree topologies – uses all the sequence information
There are also some supposed disadvantages – maximum likelihood is very CPU intensive and thus extremely slow – result is dependent on the model of evolution used
ICP-TROP
Explication of the methodExplication of the methodMaximum likelihood evaluates the probability that the choosen evolutionary model will have generated the observed sequences. Phylogenies are then inferred by finding those trees that yield the highest likelihood. Assume that we have the aligned nucleotide sequences for four taxa:
1 j ....N (1) A G G C U C C A A ....A (2) A G G U U C G A A ....A (3) A G C C C A G A A.... A (4) A U U U C G G A A.... C
and we want to evauate the likelihood of the unrooted tree represented by the nucleotides of site j in the sequence and shown below:
(1) (2) \ / \ / ------ / \ / \ (3) (4) What is the probabliity that this tree would have generated the data presented in the sequence under the the chosen model ?
ICP-TROP
The models are time-reversible, therefore the likelihood of the tree is independent of the position of the root. Thus it is convenient to root the tree at an arbitrary internal node.
C C A G \ / | / \/ | / A | / \ | / \ | / A
_ _ | C C A G | | \ / | / | | \/ | / |L(j) = Sum(Prob | (5) | / |) | \ | / | | \ | / | |_ (6) _|
Assume that nucleotide sites evolve independently (the Markovian model of evolution). Then we can calculate the likelihood for each site separately and combine these to the total likelihood.
For the likelihood for site j, we have to consider all the possible scenarios by which the nucleotides present at the tips of the tree could have evolved. So the likelihood for a particular site is the summation of the probablilities of every possible reconstruction of ancestral states, given some model of base substitution. So in this specific case all possible nucleotides A, G, C, and T occupying nodes (5) and (6), or 4 x 4 = 16 possibilities :
In the case of protein sequences each site may ooccupy 20 states (that of the 20 amino acids) an thus 400 possibilities have to be considered. Since any one of these scenarios could have led to the amino-acid configuration at the tip of the tree, we must calculate the probability of each and sum and sum them to obtain the total probability for each site j.
Likelihood for one siteLikelihood for one site
ICP-TROP
likelihood for the full treelikelihood for the full tree
The likelihood for the full tree then is the product of the likelihood at each site.
N L= L(1) x L(2) ..... x L(N) = L(j) j=1
Since the individual likelihoods are extremely small numbers it is convenient to sum the log likelihoods at each site and report the likelihood of the entire tree as the log likelihood.
N ln L= ln L(1) + ln L(2) ..... + ln L(N) = ln L(j) j=1
ICP-TROP
The model of evolutionThe model of evolution
The PROTML program in the MOLPHY package (Adachi and Hasegawa, 1992), as well as the TreePUZZLE program by Strimmer and von Haeseler (1995), have implemented an instantaneous rate matrix derived from the Dayhoff emperical substitution matrix. This has been called the Dayhoff model.
Recently a model called the JTT model of evolution and based upon the updated emperical substitution matrix of Jones et al. (1992) has been developed and and implemented in these programs.
ICP-TROP
The maximum likelihood treeThe maximum likelihood tree
The above procedure is then repeated for all The above procedure is then repeated for all possible topologies (or for all possible trees).possible topologies (or for all possible trees).
The tree with the highest probablility is the The tree with the highest probablility is the tree with the highest maximum likelihood.tree with the highest maximum likelihood.
ICP-TROP
Bootstrapping Bootstrapping is a way of testing the reliability of the dataset. It is the creation of
pseudoreplicate datasets by resampling. Bootstrapping allows you to assess whether the distribution of characters has been influenced by stochastic effects. In phylogenetic analyses nonparametric bootstrapping is the most commonly used method. The pseudoreplicate datasets are generated by randomly sampling the original character matrix to create new matrices of the same size as the original. The frequency with which a given branch is found is recorded as the bootstrap proportion. These proportions can be used as a measure of the reliability (within limitations) of individual branches in the optimal tree.
Thus bootstrap analysis:– is a statistical method for obtaining an estimate of error – is used to evaluate the reliability of a tree – is used to examine how often a particular cluster in a tree appears when nucleotides
or aminoacids are resampled
NB: If the entire dataset is compatible and has not been biased by stochastic effects, all bootstrap trees should in principle have the same topology!
ICP-TROP
The practice of bootstrapping and the construction of a consensus tree
Take a dataset consisting of in total n sequences with m sites each (see below). A number of resampled datasets of the same size (n x m) as the original dataset is produced. However, each site is sampled at random and no more sites are sampled than there were original sites. In order to be statistically significant the number of the datasets should should be high and equal or higher than the number of individual sites present in the dataset.
Our example dataset consists of in total 4 sequences with 10 sites each (see below). When three new datasets are prepared by random sampling of sites, the following three sample sets of data can be obtained:
Sample 1 0 1 2 0 3 0 1 2 0 1 (<- number of times each site is sampled) ___________________ A A G G C U C C A A A A G G G U U U C A A A B A G G U U C G A A A B G G G U U U G A A A C A G C C C C G A A A C G C C C C C G A A A D A U U U C C G A A C D U U U C C C G A A C A B C B 1 C 6 5 D 8 7 4
ICP-TROP
Sample 2Sample 2
Sample 2 1 0 0 0 2 2 2 0 0 3 ___________________ A A G G C U C C A A A A A U U C C C C A A A B A G G U U C G A A A B A U U C C G G A A A C A G C C C C G A A A C A C C C C G G A A A D A U U U C C G A A C D A C C C C G G C C C
A B C B 2 C 4 2 D 7 5 3
ICP-TROP
Sample 3Sample 3
Sample 3 1 0 0 0 2 2 2 0 0 3 ___________________ A A G G C U C C A A A A A U U C C C C A A A B A G G U U C G A A A B A U U C C G G A A A C A G C C C C G A A A C A C C C C G G A A A D A U U U C C G A A C D A C C C C G G C C C
A B C B 1 C 3 2 D 6 3 4
ICP-TROP
Consensus treeConsensus treeA large number of datasets (between hundred and thousand, depending on computer power) and the same number of different trees are so generated. In this specific case taxa A and B form a cluster in all three trees, while C clusters with D in only one tree. There exist specialised programs, such as the program Consense in the Phylip package of Joe Felsenstein, that are able to analyse all the resulting trees and prepare the most likely tree or consensus tree from those data.
The resulting consensus tree for our small dataset is shown below. The number of times each branch point or node occured (the so-called bootstrap proportion) is indicated at each node.
Result A A G G C U C C A A A B A G G U U C G A A A C A G C C C C G A A A D A U U U C C G A A C
A B C B 2 C 3 3 D 6 4 4
ICP-TROP
Again some good advice (1)Again some good advice (1)
Tree topologies may strongly depend on the following:Tree topologies may strongly depend on the following:– DNA or Protein used in the analysisDNA or Protein used in the analysis– Distance or Parsimony methods appliedDistance or Parsimony methods applied– The number of OTUs included in the alignmentThe number of OTUs included in the alignment– The order of the OTUs in the alignmentThe order of the OTUs in the alignment– The selection of a good outgroupThe selection of a good outgroup
None of the methods may guarantee the one tree with None of the methods may guarantee the one tree with the correct topologythe correct topology
ICP-TROP
So as to have an idea of the reliability of the topology of the resulting tree, one So as to have an idea of the reliability of the topology of the resulting tree, one should do one or all of the following:should do one or all of the following:– Apply more than one of different methods (distance, parsimony) to the Apply more than one of different methods (distance, parsimony) to the
dataset.dataset.– Vary the parameters used by the different programs, such as seed value Vary the parameters used by the different programs, such as seed value
and jumble factor for the order of OTU addition. and jumble factor for the order of OTU addition. – Add or remove one or more OTUs and see how this influences tree Add or remove one or more OTUs and see how this influences tree
topology.topology.– Try to include an outgroup that may serve as a root for your tree.Try to include an outgroup that may serve as a root for your tree.– Apply Bootstrap or Jacknife analyses to your dataset and prepare a Apply Bootstrap or Jacknife analyses to your dataset and prepare a
consensus tree of 100 - 1000 replicas (depending on the size of the dataset consensus tree of 100 - 1000 replicas (depending on the size of the dataset and on computer power).and on computer power).
Only when widely different methods provide you with similar or identical tree Only when widely different methods provide you with similar or identical tree topologies and such topologies are suported by good bootstrap values (> 95%) topologies and such topologies are suported by good bootstrap values (> 95%) the trees can be considered reliablethe trees can be considered reliable
Again some good advice (2)Again some good advice (2)
ICP-TROP
Limitations of the various Limitations of the various methodsmethods
Distance approachesDistance approaches (UPGMA, corrected distances and neighbor- (UPGMA, corrected distances and neighbor-joining) do not use the original (sequence) data, but derived distance joining) do not use the original (sequence) data, but derived distance information. information. Some information is said to be lostSome information is said to be lost
Character-state approachesCharacter-state approaches (Maximum Parsimony) are said to be (Maximum Parsimony) are said to be more powerful than distance methods because they use the raw more powerful than distance methods because they use the raw data. However, this is usually a small fraction of the data. Maximum data. However, this is usually a small fraction of the data. Maximum parsimony uses parsimony uses onlyonly the the informative sitesinformative sites. So when the number of . So when the number of informative sites is not large, this method is often less efficient than informative sites is not large, this method is often less efficient than distance methods (Saitou and Nei, 1986). Maximum parsimony is distance methods (Saitou and Nei, 1986). Maximum parsimony is notorious for its sensitivity to codon biasnotorious for its sensitivity to codon bias
None of the methods is reliable when OTUs with highly unequal None of the methods is reliable when OTUs with highly unequal evolutionary separation are included in the datasetevolutionary separation are included in the dataset
ICP-TROP
Some terms used in molecular Some terms used in molecular evolutionevolution
Indel:Indel: position in a sequence alignment where one of the sequences has position in a sequence alignment where one of the sequences has acquired an insertion or extension or has undergone a deletionacquired an insertion or extension or has undergone a deletion
Identity:Identity: percentage of identical residues in pairwise aligned sequences. percentage of identical residues in pairwise aligned sequences. Normally deletions or insertions are not taken into consideration, since it is Normally deletions or insertions are not taken into consideration, since it is not possible to tell how many events have been at the basis of the not possible to tell how many events have been at the basis of the creation of such an indelcreation of such an indel
Homology:Homology: two sequences are homologous or have homology when they two sequences are homologous or have homology when they have evolved from a common ancestral sequence. The same holds for the have evolved from a common ancestral sequence. The same holds for the aligned residues in a sequence alignment. Homologous residues are aligned residues in a sequence alignment. Homologous residues are derived from a common ancestral residuerity and homology as percentage derived from a common ancestral residuerity and homology as percentage should not be used. Two sequences can be similar, and have a certain should not be used. Two sequences can be similar, and have a certain percentage of identity, but cannot have a certain percentage of similarity. percentage of identity, but cannot have a certain percentage of similarity. The same holds for homology.The same holds for homology.
ICP-TROP
Some PAM ratesSome PAM rates PAMS per 100PAMS per 100
Million YearsMillion Years
IG kappa chain C region IG kappa chain C region 3737
Lactalbumin Lactalbumin 2727
Epidermal growth factor Epidermal growth factor 2626
Haptoglobin alpha chain Haptoglobin alpha chain 2020
Serum albumin Serum albumin 1919
Phospholipase A Phospholipase A 1919
Hemoglobin alpha chain Hemoglobin alpha chain 1212
Animal lysozyme Animal lysozyme 9.8 9.8
Myoglobin Myoglobin 8.9 8.9
Amyloid AA Amyloid AA 8.7 8.7
Acid proteases Acid proteases 8.4 8.4
Myelin basic protein Myelin basic protein 7.4 7.4
Cytochrome b Cytochrome b 4.5 4.5
Lactate dehydrogenase Lactate dehydrogenase 3.4 3.4
Adenylate kinase Adenylate kinase 3.2 3.2
Triosephosphate isomerase Triosephosphate isomerase 2.8 2.8
Cytochrome c Cytochrome c 2.2 2.2
Plant ferredoxin Plant ferredoxin 1.9 1.9
Glutamate dehydrogenase Glutamate dehydrogenase 0.9 0.9
Histone H4 Histone H4 0.1 0.1
(Adapted from Table 1. Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. (Adapted from Table 1. Atlas of Protein Sequence and Structure, Suppl 3, 1978, M.O. Dayhoff, ed. National Biomedical Research Foundation, 1979.)National Biomedical Research Foundation, 1979.)
ICP-TROP
The three letter amino acid The three letter amino acid codecode
AA AlaAla II IleIle SS SerSer
BB AsxAsx KK LysLys TT ThrThr
CC CysCys LL LeuLeu VV ValVal
DD AspAsp MM MetMet WW TryTry
EE GluGlu NN AsnAsn XX XxxXxx
FF PhePhe PP ProPro YY TyrTyr
GG GlyGly QQ GlnGln ZZ GlxGlx
HH HisHis RR ArgArg
ICP-TROP
Consider four hypothetical sequences:Consider four hypothetical sequences:
PHYLOGENY, PHOLOGENY, PHLOGENY, PHOLONYPHYLOGENY, PHOLOGENY, PHLOGENY, PHOLONY
Alignment can be done in various ways:Alignment can be done in various ways:
PHYLOGENYPHYLOGENY PHY-LOGENYPHY-LOGENY
PHOLOGENYPHOLOGENY oror PH-OLOGENYPH-OLOGENY
PH-LOGENYPH-LOGENY PH--LOGENYPH--LOGENY
PHOLO--NYPHOLO--NY PH-OLO--NYPH-OLO--NY
Alignment of two protein Alignment of two protein sequences (1)sequences (1)
ICP-TROP
Tree construction using Tree construction using distance-matrix methodsdistance-matrix methods
phylogenetic tree constructed from 6 aligned phylogenetic tree constructed from 6 aligned sequencessequences
A MOLECULAR--EVOLUTIONA MOLECULAR--EVOLUTION
B MOLEKULARE-EVOLUTIENB MOLEKULARE-EVOLUTIEN
C MOLECULAIREEVOLUTIENC MOLECULAIREEVOLUTIEN
D MO-ECALIAREEFOLUTIE-D MO-ECALIAREEFOLUTIE-
E MO-ESALIARE-GOLUTIU-E MO-ESALIARE-GOLUTIU-
F NO-ASELIAKE-HODATAU-F NO-ASELIAKE-HODATAU-
A
B
C
D
E
F
1
11
2
2
2
4
1
1
1
ICP-TROP 0.1
TPIS HUMANTPIS MACMUTPIS RABITTPIS MOUSETPIS RAT
TPIS LATCHTPIS CHICKTPIS SCHJA
TPIS SCHMATPIS AEDTOTPIS CULPITPIS CULTA
TPIS ANOMETPIS DROMETPIS HELVITPIS CAEEL
TPIS GRAVETPIS ARATH
TPIS PETHYTPIS COPJATPIS LACSA
TPIS HORVUTPIS SECCE
TPIS MAIZETPIS ORYSA
TPIC SPIOLTPIC SECCETPIS STELP
TPIS TRYBBTPIS TRYCRTPIS LEIME
TPI1 GIALATPI2 GIALA
TPIS EMENITPIS SCHPO
TPIS YEASTTPIS COPCI
TPIS BACSUTPIS STAAU
TPIS BACMETPIS BACSTTPIS LACDE
TPIS LACLATPIS CLOAB
TPIS BORBUTPIS SYNY3
TPIS PLAFATPIS MYCHR
TPIS MYCFLTPIS MYCHY
TPIS MYCGETPIS MYCPN
TPIS TREPATPIS MYCLE
TPIS MYCTUTPIS CORGL
TPIS STRCOTPIS XANFL
TPIS CHLAUTPIS RHIET
PGKT THEMATPIS AQUAE
TPIS VIBSATPIS PSESY
TPIS CHLPNTPIS CHLTR
TPIS ECOLITPIS ENTCL
TPIS HAEINTPIS VIBMA
TPIS BUCAPTPIS HELPJTPIS HELPY
TPIS FRATUTPIS MORSP TPIS PYRHO
TPIS PYRWOTPIS METTH
TPIS ARCFUTPIS METJA
TPIS METBR
Animalia
Planta
Protists
Fungi
Eubacteria
Archaebacteria
Triosephosphate Triosephosphate isomeraseisomerase