Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to...

Phylogeny – data mining by biologists

• Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences

Understanding our relationships

Trees are like mobiles

The language of trees

Changes can occur

The why and what of natural selection

• Variation exists at the DNA level: alleles• This variation is inexhaustible (something

important to remember when looking at new genome sequences)

• These differences are subjected to selection:– Changes in protein structure are typically unfavorable

and as a result, selected against

– However, some changes in structure/function are selected for: sickle cell anemia/malaria

Neutral Theory of Evolution - Kimura

• Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection

• Compare Fugu with humans..most conserved sequences are the genes– http://www.sciencemag.org/cgi/content/full/297/5585/1301

• Synonymous substitutions and substitutions in pseudogenes (define) are thought to be reflective of actual mutation rate operating with a genome (no selection)

• Is this accurate?

http://www.sciencemag.org/cgi/content/full/297/5585/1301

Genetic drift

• Random genetic drift is a stochastic process (by definition).

• One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults.

• Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation

Neutralist vs. Selectionist

Where do substitutions occur?

• Non-coding regions exhibit a substitution rate 2X greater than coding regions

• Coding regions are more “functionally constrained”

• Higher degeneracy of codon, higher substitution rate observed

• A thought: Coding sequences – sequence constraint; Non-coding sequence – structure constraint???

Natural variants

• Site-directed mutagenesis studies of a single gene will give way to comparative genomic studies derived from the abundance of sequence data

• As a result, it is important to understand molecular evolution and models describing this process

The relationship between time and substitutions is non-linear

Observing differences in nucleotides

• The simplest measure of distance between two sequences is to count the # of sites where the two sequences differ – called p-distance

• If all sites are not equally likely to change, the same site may undergo repeated substitutions

• As time goes by, the number of differences between two sequences becomes less and less an accurate estimator of the actual number of substitutions that have occurred

So what is phylogeneticsgood for?

Phylogenetics has direct applications to:

• Conservation: test wood, ivory, meat products for poaching

• Agriculture: analyze specific differences between cultivars

• Forensics: DNA fingerprinting

• Medicine: determine specific biochemical function of cancer-causing genes

Phylogenetic concepts:Interpreting a Phylogeny

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Time

Which sequence is most closely related to B?

A, because B diverged from A more recently than from any other sequence.

Physical position in tree is not meaningful! Only tree structure matters.

Rooted vs. unrooted

• Root – ancestor of all taxa considered

• Unrooted – relationship without consideration of ancestry

• Often specify root with outgroup– Outgroup – distantly related species (ie.

mammals and an archaeal species)

Phylogenetic concepts:Rooted and Unrooted Trees

Time

A

B

C

D

Root =

A B

C D

Root

X

=?

A B

C D

?

? ?

? ?

X

How Many Trees?

Unrooted trees Rooted trees

# sequences

# pairwise distances # trees

# branches /

tree # trees

# branches

/tree

3 3 1 3 3 4

4 6 3 5 15 6

5 10 15 7 105 8

6 15 105 9 945 10

10 45 2,027,025 17 34,459,425 18

30 435 8.69 1036 57 4.95 1038 58

N N (N - 1)

2

(2N - 5)!

2N - 3 (N - 3)!

2N - 3 (2N - 3)!

2N - 2 (N - 2)!

2N - 2

Tree Types

Root

50 million years

sharks

seahorses

frogs

owls

crocodiles

armadillosbats

Evolutionary trees measure time.

Root

sharksseahorses

frogsowls

crocodilesarmadillos

bats5% change

Phylograms measure change.

Tree Properties

Root

UltrametricityAll tips are an equal

distance from the root.X

Y

a

b

c de

a = b + c + d + e

Root

AdditivityDistance between any two tips equals the total branch

length between them.

X

Y

ab

c d

e

XY = a + b + c + d + e

In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.

Tree building

• Get protein/RNA/DNA sequences

• Construct multiple sequence alignment

• Compute pairwise distances (if necessary)

• Build tree – topology and distances

• Estimate reliability

• Visualize

Tree summary

Various models have been generated to more accurately estimate distance and evolution

• All use the following framework:

Probability matrix

pAC is the probability of a site starting with an A had a C at the end of time interval t, etc.

Base composition of sequence; fa = frequency of A

Phylogenetic Methods

Neighbor-joining• Minimizes distance between nearest neighbors

Maximum parsimony• Minimizes total evolutionary change

Maximum likelihood• Maximizes likelihood of observed data

Many different procedures exist. Three of the most popular:

Comparison of Methods

Neighbor-joining Maximum parsimony Maximum likelihood

Uses only pairwise distances

Uses only shared derived characters

Uses all data

Minimizes distance between nearest neighbors

Minimizes total distance

Maximizes tree likelihood given specific parameter values

Very fast Slow Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, homoplasy rare)

Good for very small data sets and for testing trees built using other methods

Which procedure should we use?Neighbor-

joining

Maximumparsimony

Maximumlikelihood

All that we can!

?

• Each method has its own strengths

• Use multiple methods for cross-validation

• In some cases, none of the three gives the correct phylogeny!

Jukes-Cantor Model

• Distance between any two sequences is given by: d = -3/4 ln(1-4/3p)

• p is the proportion of nucleotides that are different in the two sequences

• All substitutions are equally probable– Each position in matrix = ; except diagonal =

1-

Kimura’s two parameter model

• d = ½ ln[1/(1-2P-Q)] + ¼ ln[1/1-2Q)]

• P and Q are proportional differences between the two sequences due to transitions and transversions, respectively.

• Accounts for transition bias in sequences (transversions more rare)

Distances in Amino acid sequences

• Account for synonymous and non-synonymous changes in respective codons

• Pathways to double mutations

Dealing with multiple substitutions

• Unweighted method – pathways are equally likely • Weighted – favor synonymous changes • Degeneracy classifications

– Nondegenerate (0) – First two positions of TTT (Phe)

– Two-fold degenerate (2) – Third position of TTT (Phe)

– Four fold degenerate (4) – Third position of GTT (Val)

Evolutionary models

Implementing models and building trees

Comparing models

Trees are hypotheses about evolutionary history

So far, we’ve looked at understanding and formulating these hypotheses. Now, let’s turn our attention to testing them.

Testing the reliability of trees

• Interior branch test or Bootstrap analysis

• Bootstrap analysis – subsequences or sequence deletion or replacement; re-draw trees; how many times do you get some branching? Bootstrap values of 70 (95) or greater are normally considered reliable

Tree Testing:Split Decomposition

Split decomposition is one method for testing a tree.

A

B

C

D

A

D

B

C

A

C

B

D

Under this procedure, we choose exactly four taxa (A, B, C, D) and examine the topologies of all possible unrooted trees. How many such trees are there?

Only one of these topologies is right. How can we quantitatively assess the support for each tree?

Tree Testing:Split Decomposition

The correct tree should be approximately additive; the others usually will not. For each tree, we calculate split indices that estimate the length of the internal branch:

+A

D

B

C+

A

C

B

D

–

2Large split indices Long internal branch Topology strongly supported

Small split indices Short internal branch Topology weakly supported

Negative split indices Biologically impossible Topology probably wrong

=

if A

C

B

Dis the right phylogeny!

Tree Testing:Bootstrapping

Used to assess the support for individual branches

Randomly resample characters, with replacement

How often does a specific branch appear?

Repeat many times (1000 or more)

rathumanturtlefruit flyoakduckweed

100

98

73

Rates of nucleotide substitutions between human and mouse or rat

• Synonymous rate = 2-10 substitutions per site per 109 years in coding regions

• Nonsynonymous rate = 0-3 substitutions per site per 109 years in coding regions (more variable among genes)

• Synonymous rate exceeds nonsynonymous rate

Molecular Clocks

• Do homologous proteins evolve at the same substitution rate?

• Estimate relative rates using an outgroup

• But, what about effects of generation time, metabolic specialization, etc?

Darwin’s theory reinterpreted homology as common ancestry.

ATCGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATTA

ATAGGGCAGTTTCGCGATTA

ATAGGGCAGTTTTGCGATTA

ATAGGGCAGTTTCGCGATTA

ATAGGGCAGTCTCGCGATTA

ATCGGCCACTTTCGCGATCG

ATCGGCCACTTTCGTGATCG

ATCGGCCACGTTCGTGATCG

ATCGGCCACGTTCGCGATCG

ATCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA

Ancestral sequence

Homologous sequences

Orthologs arise by speciation


ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Orthologous sequences

Speciation event

Modern species A Modern species B

Orthologs are “evolutionary counterparts” – Koonin (2001)

Paralogs arise by duplications


ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Paralogous sequences

Duplication event

Modern duplicate A Modern duplicate B

Hardison PNAS 2001 98 :1327-1329

We have different types of hemoglobins

The major adult hemoglobin is composed of 2 chains and 2 chains. The major fetal hemoglobin is composed of 2 chains and 2 chains.

http://fai.unne.edu.ar/biologia/macromoleculas/figacro/hemoglobin.jpg

“There may thus exist a Molecular Evolutionary Clock”Zuckerkandl & Pauling (1965)

A model of sequence divergence can be used to extract the duplication dates of the difference hemoglobin chains

Duplication event

Primordial hemoglobin

Human Human Cow Cow

Speciation event

Note: This model explains why the distance betweem Human and Cow is shorter than Human – Human proximity.

PBS Evolution Library (http://www.pbs.org/wgbh/evolution/library/)

Different clocks keep different times

Between horse and man

The clock varies for different regions of the protein

For example, locations on the exterior of the protein may change at a different rate than those on the interior.

Ayala, F. Bioessays 1999 Jan;21(1):71-5

No universal clocks found!

Two terrible clocks


The common estimate is 1,100 My

What causes deviations from the clock?

1. Generation time: Shorter generation time will accelerate the clock because it shortens the time to fix new mutations.

2. Mutation rate: Species-characteristic differences in polymerases or other biological properties that affect the fidelity of DNA replication, and hence the incidence of mutations.

3. Gene function: Changes in the function of a protein as evolutionary time proceeds. This might particularly be expected in the case of gene duplication.

4. Natural selection: Organisms are continually adapting to the physical and biotic environments, which change endlessly in patterns that are unpredictable and differently significant to different species.


HIV Example 1:Florida dentist case

• 1990 case: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist?

• HIV evolves so fast that transmission patterns can be reconstructed from viral sequence (molecular forensics).

• Compared viral sequence from the dentist, three of his HIV+ patients, and two HIV+ local controls.

Florida dentist case

So what do the results mean?

• 2 of 3 patients closer to dentist than to local controls. Statistical significance? More powerful analyses?

• Do we have enough data to be confident in our conclusions? What additional data would help?

• If we determine that the dentist’s virus is linked to those of patients E and G, what are possible interpretations of this pattern? How could we test between them?

Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to...

Documents

Transcript of Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to...