Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro...

Post on 03-Jan-2016

217 views 0 download

Tags:

Transcript of Benjamin Loyle 2004 Cse 397 Solving Phylogenetic Trees Benjamin Loyle March 16, 2004 Cse 397 : Intro...

Benjamin Loyle 2004 Cse 397

Solving Phylogenetic Trees

Benjamin Loyle

March 16, 2004

Cse 397 : Intro to MBIO

Benjamin Loyle 2004 Cse 397

Table of Contents

Problem & Term Definitions A DCM*-NJ Solution Performance Measurements Possible Improvements

Benjamin Loyle 2004 Cse 397

From the Tree of the Life Website,University of Arizona

Orangutan Gorilla Chimpanzee Human

Phylogeny

Benjamin Loyle 2004 Cse 397

-3 mil yrs

-2 mil yrs

-1 mil yrs

today

AAGACTT

TGGACTTAAGGCCT

AGGGCAT TAGCCCT AGCACTT

AAGGCCT TGGACTT

TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT

AGGGCAT TAGCCCT AGCACTT

DNA Sequence Evolution

Benjamin Loyle 2004 Cse 397

Problem Definition

The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings

Even smaller relations are tough Impossible

Infer possible ancestral history.

Benjamin Loyle 2004 Cse 397

So what….

Genome sequencing provides entire map of a species, why link them?

We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution

Benjamin Loyle 2004 Cse 397

Why is that a problem?

Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes

years Error is a very large factor

Benjamin Loyle 2004 Cse 397

What do we want?

Input A collection of nodes such as taxa or protein

strings to compare in a tree Output

A topological link to compare those nodes to each other

When do we want it? FAST!

Benjamin Loyle 2004 Cse 397

Preparing the input

Create a distance matrix Sum up all of the known distances into a

matrix sized n x n N is the number of nodes or taxa

Found with sequence comparison

Benjamin Loyle 2004 Cse 397

Distance Matrix

Take 5 separate DNA strings

A : GATCCATGA B : GATCTATGCC : GTCCCATTTD : AATCCGATCE : TCTCGATAG

The distance between A and B is 2 The distance between A and C is 4

This is subjective based on what your criteria are.

Benjamin Loyle 2004 Cse 397

Distance Matrix

Lets start with an example matrix

0 63 94 111 67

0 79 96 16

0 47 83

0 100

0

A

B

C

D

E

A B C D E

Benjamin Loyle 2004 Cse 397

Lets make it simple (constrain the input)

Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are

the most dissimilar of any nodes. This is called the diameter of the tree

Lets keep the length of the input (length of the strings) polynomial.

Benjamin Loyle 2004 Cse 397

ERROR?!?!!?

All trees are inferred, how do you ever know if you’re right?

How accurate do we have to be? We can create data sets to test trees that

we create and assume that it will then work in the real world

Benjamin Loyle 2004 Cse 397

Data Sets

JC Model Sites evolve independent Sites change with the same probability Changes are single character changes

• Ie. A -> G or T -> C The expectation of change is a Poisson

variable (e)

Benjamin Loyle 2004 Cse 397

More Data Sets

K2P Model Based on JC Model Allows for probability of transitions to

tranversions• It’s more likely for A and T to switch and G and C

to switch• Normally set to twice as likely

Benjamin Loyle 2004 Cse 397

Data Use

Using these data sets we can create our own evolution of data.

Start with one “ancestor” and create evolutions

Plug the evolutions back and see if you get what you started with

Benjamin Loyle 2004 Cse 397

Aspects of Trees

Topology• The method in which nodes are connected to

each other• “Are we really connected to apes directly, or just

linked long before we could be considered mammals?”

Distance• The sum of the weighted edges to reach one

node from another

Benjamin Loyle 2004 Cse 397

What can distance tell us?

The distance between nodes IS the evolutionary distance between the nodes

The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

Benjamin Loyle 2004 Cse 397

Current Techniques Maximum Parsimony

Minimize the total number of evolutionary events

Find the tree that has a minimum amount of changes from ancestors

Maximum Likelihood Probability based Which tree is most probable to occur based

on current data

Benjamin Loyle 2004 Cse 397

More Techniques

Neighbor Joining Repeatedly joins pairs of leaves (or subtrees)

by rules of numerical optimization It shrinks the distance matrix by considering

two ‘neighbors’ as one node

Benjamin Loyle 2004 Cse 397

Learning Neighbor Joining

It will become apparent later on, but lets learn how to do Neighbor Joining (NJ)

0 3 3 4 3

0 3 3 4

0 3 3

0 3

0

A

B

C

D

E

A B C D E

Benjamin Loyle 2004 Cse 397

NJ Part 1

First start with a “star tree”

A

B C

D

E

Benjamin Loyle 2004 Cse 397

NJ Part 2

Combine the closest two nodes (from distance matrix)

• In our case it is node A and B at distance 3

A

B C

D

E

Benjamin Loyle 2004 Cse 397

NJ Part 3

Repeat this until you have added n-2 nodes (3)

• N-2 will make it a binary tree, so we only have to include one more node.

A

B C

D

E

Benjamin Loyle 2004 Cse 397

Are we done?

ML and MP, even in heuristic form take too long for large data sets

NJ has poor topological accuracy, especially for large diameter trees

We need something that works for large diameter trees and can be run fast.

Benjamin Loyle 2004 Cse 397

Here’s what we want

Our Goal An “Absolute Fast Converging” Method

is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{(e)}) is in the set Mf,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[(S) = T] > 1- €.

• Simply: Lets make it in polynomial time within a degree of error.

Benjamin Loyle 2004 Cse 397

A DCM* - NJ Solution

2 Phase construction of a final phylogenetic tree given a distance matrix d.

Phase 1 : Create a set of plausible trees for the distance matrix

Phase 2 : Find the best fitting tree

Benjamin Loyle 2004 Cse 397

Phase 1

For each q in {dij}, compute a tree tq

Let T = { tq : q in {dij} }

Benjamin Loyle 2004 Cse 397

Finding tq

Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all

maximal cliques Step 4: Merge the subtrees into a

supertree

Benjamin Loyle 2004 Cse 397

What does that mean

Breaking the problem up Create a threshold of diameters to break the

problem into• A bunch of smaller diameter trees (cliques)

Apply NJ to those cliques Merge them back

Benjamin Loyle 2004 Cse 397

Finding tq (terms)

Threshold Graph Thresh(d,q) is the threshold graph where (i,j)

is an edge if and only if dij <= q.

Benjamin Loyle 2004 Cse 397

Threshold

Lets bring back our distance matrix and create a threshold with q equal to d15 or the distance between A and E So q = 67

Benjamin Loyle 2004 Cse 397

Distance Matrix

Our old example matrix

0 63 94 111 67

0 79 96 16

0 47 83

0 100

0

A

B

C

D

E

A B C D E

Benjamin Loyle 2004 Cse 397

With q = D15 = 67

A

B

C

D

E

47

6763

16

Benjamin Loyle 2004 Cse 397

Triangulating

A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive

vertices of the cycle. Our example is already triangulated, but

lets look at another

Benjamin Loyle 2004 Cse 397

Triangulating

W X

Y Z

5

5

5

5

Lets say this is for q = 5

10

15

10 and 15 wouldNot be in the graph

To triangulate this graph you add theedge length 10.

Benjamin Loyle 2004 Cse 397

Maximal Cliques

A clique that cannot be enlarged by the addition of another vertex.

Recall our original threshold graph which is triangulated:

Benjamin Loyle 2004 Cse 397

Triangulated Threshold Graph

Our old Graph

A

B

C

D

E

47

6763

16

Benjamin Loyle 2004 Cse 397

Clique

Our maximal cliques would be:

{A, B, E}

{C, D}

Benjamin Loyle 2004 Cse 397

Create Trees for the Cliques

We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?

Benjamin Loyle 2004 Cse 397

Tree {A, B, E} and {C,D}

A

B

E

C D

Benjamin Loyle 2004 Cse 397

Merge your separate trees together.

Create one Supertree This is done by creating a minimum set of

edges in the trees and calling that the “backbone”

This is it’s own doctorial thesis, so lets do a little hand waving

Benjamin Loyle 2004 Cse 397

That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be

obtained in polynomial time using a greedy heuristic without too much loss in performance.

Maximal cliques is only polynomial if the data input is triangulated (which it is!).

If all previous are done, creating a supertree can be done in polynomial time as well.

Benjamin Loyle 2004 Cse 397

Where are we now? We now have a finalized phylogeny created for from smaller

trees in our matrix joined together Remember we started from all possible size of smaller trees.

Benjamin Loyle 2004 Cse 397

Phase 2

Which one is right? Found using the SQS (Short Quartet

Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa

• {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc• Reduce the larger tree to only hold “one set”• These are called Quartets

Benjamin Loyle 2004 Cse 397

SQS - A Guide

Q(T) is the set of trees induced by T on each set of four leaves.

Let Qw (different Q) be a set of quartets with diameter less than or equal to w

Find the maximum w where the quartets are inclusive of the nodes of the tree

This w is the “support” of that tree

Benjamin Loyle 2004 Cse 397

SQS - Refrased

Qw is the set of quartet trees which have a diameter <= w

Support of T is the max w where Qw is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,

Benjamin Loyle 2004 Cse 397

Qw =

A B C D A B D E

A B C D A B C DE E

Benjamin Loyle 2004 Cse 397

SQS Method

Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the

tree found first. This is the tree with the smallest original

diameter (remember from phase 1)

Benjamin Loyle 2004 Cse 397

How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy

Remove one edge in the tree we’ve created.• We now have two trees

Is there anyway to create the same set of leaves by removing one edge in our data set?

• If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not

identical

Benjamin Loyle 2004 Cse 397

Performance of DCM * - NJ

Outperforms NJ method at sequence lengths above 4000 and with more taxa.

NJ

DCM-NJ

0 400 800 16001200No. Taxa

0

0.2

0.4

0.6

0.8

Err

or R

ate

Benjamin Loyle 2004 Cse 397

Improvements

Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP)

Try and minimize the overall size of the tree Test using statistical evidence

Maximum Likelihood (ML)

Benjamin Loyle 2004 Cse 397

Performance gains

Simply changing Phase 2 has massive gains in accuracy!

DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard.

DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

Benjamin Loyle 2004 Cse 397

Comparing Improvements

DCM-NJ+SQS

NJ

DCM-NJ+MP

HGT-FP

0 400 800 16001200# leaves

0

0.2

0.4

0.6

0.8

Err

or R

ate