Phylogenetic trees as a visualization tools for evolutionary classification.

47
Phylogenetic trees as a visualization tools for evolutionary classification
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    227
  • download

    1

Transcript of Phylogenetic trees as a visualization tools for evolutionary classification.

Phylogenetic trees as a visualization tools for evolutionary classification

Chimp HumanGorillaHuman ChimpGorilla

=

Chimp GorillaHuman

= =

Human GorillaChimp

Trees

Same thing…

s4 s5s1 s3s2s4 s5s1 s3s2

=

Bifurcating / Multifurcating

s4 s5s1 s3s2

A multifurcation = Polytomy

s4 s5s1 s3s2

Dichotomy

There are two types of polytomies: soft (lack of information to resolve the tree) and hard (multiple divergence in short evolutionary time).

A “comb”

A comb

s4 s5s1 s3s2

Terminology

A branch =An edge

External node - leaf

Human ChimpChicken Gorilla

The root

Internal nodes

Ingroup / Outgroup:

Human ChimpChicken Gorilla

INGROUPOUTGROUP

Subtrees

Human ChimpChicken GorillaDuck

A subtree

Monophyletic groups

Human ChimpChicken Gorilla

The Gorilla+Human+Chimp are monophyletic.A clade is a monophyletic group.

Paraphyletic = Non-monophyletic groups

Whale ChimpDrosophila Zebrafish

The Zebrafish+Whale are paraphyletic

The maximum parsimony principle.3. Tree building

Genes: 0 = absence, 1 = presence

speciesg1g2g3g4g5g6

s1100110

s2001000

s3110000

s4110111

s5001110

3. Tree building

s1 s4 s3 s2 s5

Evaluate this tree…

3. Tree building

s1 s4 s3 s2 s5

Gene number 1

1 1 1 0 0

10

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 1, Option number 1.

1 1 1 0 0

1

0

1

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 1, Option number 2.

Number of changes for gene 1 (character 1) = 1

1 1 1 0 0

1

0

0

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 2, Option number 1.

0 1 1 0 0

1

0

0

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 2, Option number 2.

0 1 1 0 0

1

0

1

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 2, Option number 3.

0 1 1 0 0

0

0

0

0

Number of changes for gene 2 (character 2) = 2

3. Tree building

s1 s4 s3 s2 s5

Gene number 3, Option number 1.

0 0 0 1 1

0

1

0

0

3. Tree building

s1 s4 s3 s2 s5

Gene number 3, Option number 2.

0 0 0 1 1

0

1

1

0

Number of changes for gene 3 (character 3) = 1

3. Tree building

s1 s4 s3 s2 s5

Gene number 4, Option number 1.

1 1 0 0 1

1

1

1

1

3. Tree building

s1 s4 s3 s2 s5

Gene number 4, Option number 2.

1 1 0 0 1

0

0

0

1

Number of changes for gene 4 (character 4) = 2

3. Tree building

Gene number 5 is the same as Gene number 4

Number of changes for gene 5 (character 5) = 2

3. Tree building

s1 s4 s3 s2 s5

Gene number 6, 1 option only:

0 1 0 0 0

0

0

0

0

Number of changes for gene 6 (character 6) = 1

3. Tree building

Sum of changes

Number of changes for gene 6 (character 6) = 1

Number of changes for gene 5 (character 5) = 2

Number of changes for gene 4 (character 4) = 2

Number of changes for gene 3 (character 3) = 1

Number of changes for gene 2 (character 2) = 2

Sum of changes for this tree topology = 9

Can we do better ???

Number of changes for gene 1 (character 1) = 1

3. Tree building

s1 s4 s3 s2 s5

The MP (most parsimonious) tree:

Sum of changes for this tree topology = 8

3. Tree building

How to efficiently compute the MP score of a tree

The Fitch algorithm (1971):

A GC CA

Human ChimpChicken GorillaDuck

{A,G}

{A,C,G}

{A,C}

{A,C}

Postorder tree scan. In each node, if the intersection between the leaves is empty: we apply a union operator. Otherwise, an intersection.

Number of changes

A GC CA

Human ChimpChicken GorillaDuck

{A,G}

{A,C,G}

{A,C}

{A,C}

Total number of changes = number of union operators.

Patterns:

A GC CA

Human ChimpChicken GorillaDuck

{A,G}

{A,C,G}

{A,C}

{A,C}

CACAG require the same number of changes as CACAT, or in general all those positions with the pattern XYXYZ.

Ex:

GACA GGGACAAG GCGAGAAA

Human ChimpChicken GorillaDuck

Find min. number of changes. Point to all identical patterns.

Ambiguous characters:

A GC CR = {A,G}

Human ChimpChicken GorillaDuck

{A,G}

{A,C,G}

{A,G,C}

{A,C,G}

R = {A,G} = Purine..

Subtrees

Each node has an ID.

7 82 53

Human ChimpChicken GorillaDuck

6

4

1

0Subtree of node 4.

The Sankoff algorithm:

Generalization: they assume a cost function Cij for changing from i to j.

If Cij = 1, it just counts number of changes.

We now search for the tree with the min. cost.

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

Easy to compute for the leaves.

For example S2(A) = 0 (no cost in A there)

S2(C) = S2(G) = S2(T) ∞ (they just can’t be there).

7 82 53

A G A A C

6

4

1

0

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

78

25

3

A GA AC

6

4

1

0

[0, ∞, ∞, ∞] [∞, 0, ∞, ∞] [0, ∞, ∞, ∞] [0, ∞, ∞, ∞] [∞, ∞, 0 , ∞]

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

0

[s1(A), s1(C), s1(G), s1(T)]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[s2(A), s2(C), s2(G), s2(T)]

S0(A) = min x (CAX + S1(X)) + min Y (CAY+S2(Y))

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

0

[13, 17, 22, 14]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[15,14,21,17]

S0(A) = min { 13, 17 + 3, 22 + 1, 14 + 2 } + min { 15, 14 + 3, 21 + 1, 17 + 2 }

=13 + 15 = 28.

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

[13, 17, 22, 14]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[15,14,21,17]

S0(C) = min { 13 + 3, 17, 22 + 2, 14 + 1 } + min { 15 + 3, 14, 21 + 2, 17 + 1 }

=15 + 14 = 29.

[28,x,y,z}

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

[13, 17, 22, 14]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[15,14,21,17]

S0(G) = min { 13 + 1, 17 + 2, 22, 14 + 3 } + min { 15 + 1, 14 + 2, 21, 17 + 3 }

=14 + 16 = 30.

[28,29,y,z}

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

[13, 17, 22, 14]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[15,14,21,17]

S0(T) = min { 13 + 2, 17 + 1, 22 + 3, 14 } + min { 15 + 2, 14 + 1, 21 + 3, 17 }

=14 + 15 = 29.

[28,29,30,z}

Definition: Si(k) = Minimum cost of the subtree of node i, given that the assignment of node i = character k.

1

[28,29,30,29}

[13, 17, 22, 14]

ACGT

A0312

C3021

G1203

T2130

Costs:

2

[15,14,21,17]

The cost of the tree is the minimum of this vector, which is 28.

Dynamic programming.

This is an example of dynamic programming, because you first solve some small problems, and then recursively, use these solutions to build a solution to a larger problem.

Exercise. Compute minimal cost for this tree

7 82 53

A G A C C

6

4

1

0

ACGT

A02.512.5

C2.502.51

G12.502.5

T2.512.50

Solution: the vector at the root should be [6,6,7,8], thus, the answer is 6.