Phylogenetic Analysis

66
Phylogenetic Analysis

description

Phylogenetic Analysis. Phylogenetic Analysis Overview. Insight into evolutionary relationships Inferring or estimating these evolutionary relationships shown as branches of a tree Length and nesting reflects degree of similarity between any two items (in our case, sequences). - PowerPoint PPT Presentation

Transcript of Phylogenetic Analysis

Page 1: Phylogenetic Analysis

Phylogenetic Analysis

Page 2: Phylogenetic Analysis

2

Phylogenetic Analysis Overview

• Insight into evolutionary relationships

• Inferring or estimating these evolutionary relationships shown as branches of a tree

• Length and nesting reflects degree of similarity between any two items (in our case, sequences)

Page 3: Phylogenetic Analysis

3

Phylogenetics and Cladistics

• Clade = a set of descendants from a single ancestor (Greek work for branch)

• Three basic assumptions– Any group of organisms are related b descent

from a common ancestor– There is a bifurcating pattern of cladogenesis– Change in characteristics occurs in lineages

over time

Page 4: Phylogenetic Analysis

4

More default assumptions

1. Correct sequences and origins2. Shared ancestral origin3. Homologous sequences4. No mixtures of nuclear and organellar

sequences5. Large enough taxa sampling size6. Contains representative sequence

variations7. Sufficient sequence variations

Page 5: Phylogenetic Analysis

5

Basic Terminology• Clades: a group of organisms

or genes that includes the most recent common ancestor of all of its members and all of the descendants of that most recent common ancestor.

• Taxons: any named group of organisms; not necessary a clade.

• Branches: branches sometimes correspond to the degree of divergence

• Nodes: a bifurcating branch point

Branch lengths are not significant

Branch lengths are significant

Page 6: Phylogenetic Analysis

6

Basic Definition• Homologous: sequences that share an arbitrary

threshold level of similarity determined by alignment of matching bases

• Similarity: a quantifiable term that refers to a degree of relatedness between sequences, but does not necessarily reflect ancestry.

• Orthologs: homologs produced by speciation; derived from a common ancestor; tend to have similar function

• Paralogs: homologs produced by gene duplication; derived within an organism, tend to have differing functions

• Xenologs: homologs resulting from horizontal gene transfer between two organisms; difficult to verify; variable function but tends to be similar.

Page 7: Phylogenetic Analysis

7

Phylogenetic Analysis Overview

• Objective: – determine branch length and to figure out

how the tree should be drawn– Sequences most closely related drawn as

neighboring branches

Page 8: Phylogenetic Analysis

8

Phylogenetic Analysis Overview

• Dependent upon good multiple sequence alignment programs

• Group sequences with similar patterns of substitutions in order to reconstruct a phylogenetic tree

Page 9: Phylogenetic Analysis

9

Phylogenetic Analysis Overview

• Consider two sequences that are related– Ancestoral sequence can be (partially)

derived– With additional sequences, more

information can be gathered to add to a correct derivation

Page 10: Phylogenetic Analysis

10

Phylogenetic Analysis Overview• Example: C-Terminal Motor Kinesin sequences

– http://www.proweb.org/kinesin/BE4_Cterm.html

Page 11: Phylogenetic Analysis

11

Practical use of phylogenetic analysis

• To prioritize the analysis of genes in the target family – give insight into protein functions

Page 12: Phylogenetic Analysis

12

•P. asruginosa, a bacteria that is one of the top 3 causes or opportunistic infections, is noted for its antimicrobial resistance and resistance to detergents. •3 homologous outer membrane proteins, OprJ, OprM and OprN were identified as playing a role in this antimicrobial resistance.

Page 13: Phylogenetic Analysis

13

Figure 14.2 Example of a phylogenetic tree based on genes that does not match organismal phylogeny, suggesting horizontal gene transfer has occurred.

Possible horizontal gene transfer

Page 14: Phylogenetic Analysis

14

Uses of Phylogenetic Analysis

• Given a set of genes, determine which genes are likely to have equivalent functions

• Follow changes occurring in a rapidly changing species such as a virus– Example: influenza – Study of rapidly changing genes – Next year’s strain can be predicted– Flu vaccination can be developed

Page 15: Phylogenetic Analysis

15

UCMP Glossary: Phylogenetics

Page 16: Phylogenetic Analysis

16

Tree of Life• Phylogenies study how the evolution of

species has occurred

• Image: http://microbialgenome.org/primer/tree.html

Page 17: Phylogenetic Analysis

17

Tree of Life

• Traditionally, morphological (visible features) characters have been used to classify organisms– Living organisms– Fossil records

• Sequence data beginning to take larger role

Page 18: Phylogenetic Analysis

18

Tree of Life

• Many different resources including:

– NCBI taxonomy web sites

– University of Arizona’s tree of life project

Page 19: Phylogenetic Analysis

19

NCBI Taxonomy Web Site

• http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/

分類法 ; 分類學

Page 20: Phylogenetic Analysis

20

Tree of Life

• http://tolweb.org/tree/phylogeny.html

Page 21: Phylogenetic Analysis

21

Evolutionary Trees

• Two dimensional graph showing evolutionary relationship among a set of items

• can be organisms, genes, or sequences

• Each unit is defined by a distinct branch on the tree

Page 22: Phylogenetic Analysis

22

Evolutionary Trees

• leaves represent the units (taxa) being studied

• nodes and branches representing the relationships among the taxa

• Two taxa derived from the same common ancestor will share a node in the graph

Page 23: Phylogenetic Analysis

23

Evolutionary Trees

• length of each branch may be drawn according to the number of sequence level changes that occurred

• distance may not be in direct relation to evolutionary time

• uniform rate of mutation analyses use the molecular clock hypothesis

Page 24: Phylogenetic Analysis

24

Rooted Trees• One sequence (root) defined to be common

ancestor of all of the other sequences

• A unique path leads from the root node to any other node

• Direction of path indicates evolutionary time

• Root chosen as a sequence thought to have branched off earliest

Page 25: Phylogenetic Analysis

25

Rooted Trees

• If molecular clock hypothesis holds, it is possible to predict a root

• As the number of sequences increase, the number of possible rooted trees increases very rapidly

• In most cases, a bifurcating binary tree is the best model to simulate evolutionary events

Page 26: Phylogenetic Analysis

26

Example Rooted Tree

SYSTEMATICS AND MOLECULAR PHYLOGENETICSImage source: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

Page 27: Phylogenetic Analysis

27

Unrooted Tree (Star)

• Indicates evolutionary relationship without revealing the location of the oldest ancestry

• Fewer possible unrooted trees than a rooted tree

Page 28: Phylogenetic Analysis

28

Example Unrooted Tree

Image source: http://www.shef.ac.uk/english/language/quantling/images/quantling1.jpg

Page 29: Phylogenetic Analysis

29• Image: http://www.ncbi.nlm.nih.gov/About/primer/phylo.html

                          

                                     

Page 30: Phylogenetic Analysis

30

Methods for Determining Trees

• Three main methods:– maximum parsimony– Distance– maximum likelihood

Page 31: Phylogenetic Analysis

31

Maximum Parsimony

• Predicts evolutionary tree minimizing number of steps required to generate observed variation

• Multiple sequence alignment must first be obtained

Page 32: Phylogenetic Analysis

32

Maximum Parsimony

• For each position, phylogenetic trees requiring the smallest number of evolutionary changes to produce the observed sequence changes are identified

• Trees that produce the smallest number of changes for all sequence positions are identified

Page 33: Phylogenetic Analysis

33

Maximum Parsimony

• Time consuming algorithm

• Only works well if the sequences have a strong sequence similarity

Page 34: Phylogenetic Analysis

34

Maximum Parsimony Example

1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G

• four sequences, three possible unrooted trees

Page 35: Phylogenetic Analysis

35

Maximum Parsimony Example

Possible Trees:

1

2 4

31

3 4

21

4 2

3

Page 36: Phylogenetic Analysis

36

Maximum Parsimony Example

• Some sites are informative, and other sites are not

• Informative site has the same sequence character in at least two different sequences

• Only the informative sites need to be considered

Page 37: Phylogenetic Analysis

37

1 A A G A G T G C A2 A G C C G T G C G3 A G A T A T C C A4 A G A G A T C C G

Three informative columns

Maximum Parsimony Example

Page 38: Phylogenetic Analysis

38

Maximum Parsimony Example

1 G G A2 G G G3 A C A4 A C G

1

2 4

3 1

3 4

2 1

4 2

3

1

2 4

3 1

3 4

2 1

4 2

3

Column 1

Column 2

Column 3

1

2 4

3 1

3 4

2 1

4 2

3

Is a substitution

Page 39: Phylogenetic Analysis

39

Distance Method

• Looks at the number of changes between each pair in a group of sequences

• Goal is to identify a tree that positions neighbors correctly and that also has branch lengths which reproduce the original data as closely as possible

Page 40: Phylogenetic Analysis

40

Distance Method

• CLUSTALW uses the neighbor-joining method as a guide to multiple sequence alignments

• PHYLIP suite of programs employ neighbor-joining methods

– http://evolution.genetics.washington.edu/phylip.html

Page 41: Phylogenetic Analysis

41

Distance Programs in Phylip

• NEIGHBOR: estimates phylogenies using either:

– neighbor-joining (no molecular clock assumed)

– unweighted pair group method with arithmetic mean (UPGMA) (molecular clock assumed)

Page 42: Phylogenetic Analysis

42

Distance Analysis

• distance score counted as – number of mismatched positions in the

alignment– number of sequence positions that must be

changed to generate the second sequence

• Success depends on degree the distances among a set of sequences can be made additive on a predicted evolutionary tree

Page 43: Phylogenetic Analysis

43

Example of Distance Analysis

• Consider the alignment:

A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT

Page 44: Phylogenetic Analysis

44

Example of Distance Analysis

• Distances can be shown as a table

A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT

Page 45: Phylogenetic Analysis

45

Example of Distance Analysis

• Using this information, a tree can be drawn:

A ACGCGTTGGGCGATGGCAACB ACGCGTTGGGCGACGGTAATC ACGCATTGAATGATGATAATD ACACATTGAGTGATAATAAT

C

D

A

B

41

2

2

1

Page 46: Phylogenetic Analysis

46

Fitch and Margoliash Algorithm (3 sequences)

• Distance table used

• Sequences combined in threes – define the branches of the predicted tree– calculate the branch lengths of the tree

Page 47: Phylogenetic Analysis

47

Fitch and Margoliash Algorithm (3 sequences)

• 1) Draw unrooted tree with three branches originating from common node:

Cc

b

a

B

A

Page 48: Phylogenetic Analysis

48

Fitch and Margoliash Algorithm (3 sequences)

1) Calculate lengths of tree branches algebraically:

• distance from A to B = a + b = 22 (1)• distance from A to C = a + c = 39 (2)• distance from B to C = b + c = 41 (3)•  • subtracting (3) from (2) yields:•  • b + c = 41• -a – c = -39• __________• b – a = 2 (4)•  • adding (1) and (4) yields 2b = 24; b = 12• so a + 12 = 22; a = 10• 10 + c = 39; c = 29

Page 49: Phylogenetic Analysis

49

Fitch and Margoliash Algorithm (3 sequences)

• 3) Resulting tree:

C29

12

10

B

A

Page 50: Phylogenetic Analysis

50

Fitch and Margoliash Algorithm (5 sequences)

• Algorithm can be extended to more sequences. Consider the distances:

A

B

C

D

E

a

bd

c

e

f

g

Page 51: Phylogenetic Analysis

51

Summary of Fitch-Margoliash

1)      Find the mostly closely related pairs of sequences (A, B).

2)      Treat the rest of the sequences as a composite. Calculate the average distance from A to all others; and from B to all others.

3)      Use these values to calculate the length of the edges a and b.

Page 52: Phylogenetic Analysis

52

Summary of Fitch-Margoliash

4)      Treat A and B as a composite. Calculate the average distances between AB and each of the other sequences. Create a new distance table.

5)      Identify next pair of related sequences and begin as with step 1.

6)      Subtract extended branch lengths to calculate lengths of intermediate branches.

Page 53: Phylogenetic Analysis

53

Summary of Fitch-Margoliash

7)      Repeat the entire process with all possible pairs of sequences.

8)      Calculate predicted distances between each pair of sequences for each tree to find the best tree.

Page 54: Phylogenetic Analysis

54

Neighbor Joining

• Similar to Fitch-Margoliash

• Sequences chosen to give best least-squares estimate of branch length

Page 55: Phylogenetic Analysis

55

Maximum Likelihood

• Calculates likelihood of a tree given an alignment

• Trees with least number of changes will be most likely

Page 56: Phylogenetic Analysis

56

Maximum Likelihood (ML)

• Probability of each tree is product of mutation rates in each branch

• Likelihoods given by each column multiplied to give the likelihood of the tree

Page 57: Phylogenetic Analysis

57

Maximum Likelihood (ML)

• Disadvantages:– Computationally intensive– Can only be done for a handful of

sequences

Page 58: Phylogenetic Analysis

58

Which Method to Choose?

• Depends upon the sequences that are being compared– strong sequence similarity:

• maximum parsimony – clearly recognizable sequence similarity

• distance methods – All others:

• maximum likelihood

Page 59: Phylogenetic Analysis

59

Distance, Parsiomony and ML• Distance matrix: simply count the number of differences

between two sequences.• Maximum Parsimony: search for a tree that requires

the smallest number of changes to explain the differences observed among the taxa.

• ML: evaluates the probability that the chosen evolutionary model has generated the observed data. A simple model is that changes between all nucleotides (or amino acids) are equally probable. The probability for all possible reconstructions are summed up to yield the likelihood for one particular site. The likelihood for the tree is the product of the likelihoods for all alignment positions in the dataset.

Page 60: Phylogenetic Analysis

60

Which Method to Choose?

• Best to choose at least two approaches

• Compare the results – if they are similar, you can have more confidence

Page 61: Phylogenetic Analysis

61

Difficulties With Phylogenetic Analysis

• Horizontal or lateral transfer of genetic material (for instance through viruses) makes it difficult to determine phylogenetic origin of some evolutionary events.

• Genes selective pressure can be rapidly evolving, masking earlier changes that had occurred phylogenetically.

Page 62: Phylogenetic Analysis

62

Difficulties With Phylogenetic Analysis

• Two sites within comparative sequences may be evolving at different rates.

• Re-arrangements of genetic material can lead to false conclusions.

• Duplicated genes can evolve along separate pathways, leading to different functions

Page 64: Phylogenetic Analysis

64

Exercise

• Multiple Sequence Alignment– Sequence Alignment: CLUSTALW

– Sample sequences: found on E-learning system

Page 65: Phylogenetic Analysis

65

Explanation on the parameters

Page 66: Phylogenetic Analysis

66

Exercise