Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to...

53
Phylogeny – data mining by biologists • Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences

Transcript of Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to...

Page 1: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Phylogeny – data mining by biologists

• Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences

Page 2: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Understanding our relationships

Page 3: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Trees are like mobiles

Page 4: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

The language of trees

Page 5: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Changes can occur

Page 6: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

The why and what of natural selection

• Variation exists at the DNA level: alleles• This variation is inexhaustible (something

important to remember when looking at new genome sequences)

• These differences are subjected to selection:– Changes in protein structure are typically unfavorable

and as a result, selected against

– However, some changes in structure/function are selected for: sickle cell anemia/malaria

Page 7: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Neutral Theory of Evolution - Kimura

• Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection

• Compare Fugu with humans..most conserved sequences are the genes– http://www.sciencemag.org/cgi/content/full/297/5585/1301

• Synonymous substitutions and substitutions in pseudogenes (define) are thought to be reflective of actual mutation rate operating with a genome (no selection)

• Is this accurate?

Page 8: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Genetic drift

• Random genetic drift is a stochastic process (by definition).

• One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults.

• Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation

Page 9: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Neutralist vs. Selectionist

Page 10: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Where do substitutions occur?

• Non-coding regions exhibit a substitution rate 2X greater than coding regions

• Coding regions are more “functionally constrained”

• Higher degeneracy of codon, higher substitution rate observed

• A thought: Coding sequences – sequence constraint; Non-coding sequence – structure constraint???

Page 11: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Natural variants

• Site-directed mutagenesis studies of a single gene will give way to comparative genomic studies derived from the abundance of sequence data

• As a result, it is important to understand molecular evolution and models describing this process

Page 12: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

The relationship between time and substitutions is non-linear

Page 13: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Observing differences in nucleotides

• The simplest measure of distance between two sequences is to count the # of sites where the two sequences differ – called p-distance

• If all sites are not equally likely to change, the same site may undergo repeated substitutions

• As time goes by, the number of differences between two sequences becomes less and less an accurate estimator of the actual number of substitutions that have occurred

Page 14: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

So what is phylogeneticsgood for?

Phylogenetics has direct applications to:

• Conservation: test wood, ivory, meat products for poaching

• Agriculture: analyze specific differences between cultivars

• Forensics: DNA fingerprinting

• Medicine: determine specific biochemical function of cancer-causing genes

Page 15: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Phylogenetic concepts:Interpreting a Phylogeny

Sequence A

Sequence B

Sequence C

Sequence D

Sequence E

Time

Which sequence is most closely related to B?

A, because B diverged from A more recently than from any other sequence.

Physical position in tree is not meaningful! Only tree structure matters.

Page 16: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Rooted vs. unrooted

• Root – ancestor of all taxa considered

• Unrooted – relationship without consideration of ancestry

• Often specify root with outgroup– Outgroup – distantly related species (ie.

mammals and an archaeal species)

Page 17: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Phylogenetic concepts:Rooted and Unrooted Trees

Time

A

B

C

D

Root =

A B

C D

Root

X

=?

A B

C D

?

? ?

? ?

X

Page 18: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

How Many Trees?

Unrooted trees Rooted trees

# sequences

# pairwise distances # trees

# branches /

tree # trees

# branches

/tree

3 3 1 3 3 4

4 6 3 5 15 6

5 10 15 7 105 8

6 15 105 9 945 10

10 45 2,027,025 17 34,459,425 18

30 435 8.69 1036 57 4.95 1038 58

N N (N - 1)

2

(2N - 5)!

2N - 3 (N - 3)!

2N - 3 (2N - 3)!

2N - 2 (N - 2)!

2N - 2

Page 19: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree Types

Root

50 million years

sharks

seahorses

frogs

owls

crocodiles

armadillosbats

Evolutionary trees measure time.

Root

sharksseahorses

frogsowls

crocodilesarmadillos

bats5% change

Phylograms measure change.

Page 20: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree Properties

Root

UltrametricityAll tips are an equal

distance from the root.X

Y

a

b

c de

a = b + c + d + e

Root

AdditivityDistance between any two tips equals the total branch

length between them.

X

Y

ab

c d

e

XY = a + b + c + d + e

In simple scenarios, evolutionary trees are ultrametric and phylograms are additive.

Page 21: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree building

• Get protein/RNA/DNA sequences

• Construct multiple sequence alignment

• Compute pairwise distances (if necessary)

• Build tree – topology and distances

• Estimate reliability

• Visualize

Page 22: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree summary

Page 23: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Various models have been generated to more accurately estimate distance and evolution

• All use the following framework:

Probability matrix

pAC is the probability of a site starting with an A had a C at the end of time interval t, etc.

Base composition of sequence; fa = frequency of A

Page 24: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Phylogenetic Methods

Neighbor-joining• Minimizes distance between nearest neighbors

Maximum parsimony• Minimizes total evolutionary change

Maximum likelihood• Maximizes likelihood of observed data

Many different procedures exist. Three of the most popular:

Page 25: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Comparison of Methods

Neighbor-joining Maximum parsimony Maximum likelihood

Uses only pairwise distances

Uses only shared derived characters

Uses all data

Minimizes distance between nearest neighbors

Minimizes total distance

Maximizes tree likelihood given specific parameter values

Very fast Slow Very slow

Easily trapped in local optima

Assumptions fail when evolution is rapid

Highly dependent on assumed evolution model

Good for generating tentative tree, or choosing among multiple trees

Best option when tractable (<30 taxa, homoplasy rare)

Good for very small data sets and for testing trees built using other methods

Page 26: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Which procedure should we use?Neighbor-

joining

Maximumparsimony

Maximumlikelihood

All that we can!

?

• Each method has its own strengths

• Use multiple methods for cross-validation

• In some cases, none of the three gives the correct phylogeny!

Page 27: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Jukes-Cantor Model

• Distance between any two sequences is given by: d = -3/4 ln(1-4/3p)

• p is the proportion of nucleotides that are different in the two sequences

• All substitutions are equally probable– Each position in matrix = ; except diagonal =

1-

Page 28: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Kimura’s two parameter model

• d = ½ ln[1/(1-2P-Q)] + ¼ ln[1/1-2Q)]

• P and Q are proportional differences between the two sequences due to transitions and transversions, respectively.

• Accounts for transition bias in sequences (transversions more rare)

Page 29: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Distances in Amino acid sequences

• Account for synonymous and non-synonymous changes in respective codons

• Pathways to double mutations

Page 30: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Dealing with multiple substitutions

• Unweighted method – pathways are equally likely • Weighted – favor synonymous changes • Degeneracy classifications

– Nondegenerate (0) – First two positions of TTT (Phe)

– Two-fold degenerate (2) – Third position of TTT (Phe)

– Four fold degenerate (4) – Third position of GTT (Val)

Page 31: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Evolutionary models

Page 32: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Implementing models and building trees

Page 33: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Comparing models

Page 34: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Trees are hypotheses about evolutionary history

So far, we’ve looked at understanding and formulating these hypotheses. Now, let’s turn our attention to testing them.

Page 35: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Testing the reliability of trees

• Interior branch test or Bootstrap analysis

• Bootstrap analysis – subsequences or sequence deletion or replacement; re-draw trees; how many times do you get some branching? Bootstrap values of 70 (95) or greater are normally considered reliable

Page 36: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree Testing:Split Decomposition

Split decomposition is one method for testing a tree.

A

B

C

D

A

D

B

C

A

C

B

D

Under this procedure, we choose exactly four taxa (A, B, C, D) and examine the topologies of all possible unrooted trees. How many such trees are there?

Only one of these topologies is right. How can we quantitatively assess the support for each tree?

Page 37: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree Testing:Split Decomposition

The correct tree should be approximately additive; the others usually will not. For each tree, we calculate split indices that estimate the length of the internal branch:

+A

D

B

C+

A

C

B

D

2Large split indices Long internal branch Topology strongly supported

Small split indices Short internal branch Topology weakly supported

Negative split indices Biologically impossible Topology probably wrong

=

if A

C

B

Dis the right phylogeny!

Page 38: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Tree Testing:Bootstrapping

Used to assess the support for individual branches

Randomly resample characters, with replacement

How often does a specific branch appear?

Repeat many times (1000 or more)

rathumanturtlefruit flyoakduckweed

100

98

73

Page 39: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Rates of nucleotide substitutions between human and mouse or rat

• Synonymous rate = 2-10 substitutions per site per 109 years in coding regions

• Nonsynonymous rate = 0-3 substitutions per site per 109 years in coding regions (more variable among genes)

• Synonymous rate exceeds nonsynonymous rate

Page 40: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Molecular Clocks

• Do homologous proteins evolve at the same substitution rate?

• Estimate relative rates using an outgroup

• But, what about effects of generation time, metabolic specialization, etc?

Page 41: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Darwin’s theory reinterpreted homology as common ancestry.

ATCGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATCA

ATAGGCCACTTTCGCGATTA

ATAGGGCAGTTTCGCGATTA

ATAGGGCAGTTTTGCGATTA

ATAGGGCAGTTTCGCGATTA

ATAGGGCAGTCTCGCGATTA

ATCGGCCACTTTCGCGATCG

ATCGGCCACTTTCGTGATCG

ATCGGCCACGTTCGTGATCG

ATCGGCCACGTTCGCGATCG

ATCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCG

ACCGGCCACCTTCGCGATCGATAGGGCAGTCTCGCGATTA

Ancestral sequence

Homologous sequences

Page 42: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Orthologs arise by speciation

ATCGGCCACTTTCGCGATCA

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Orthologous sequences

Speciation event

Modern species A Modern species B

Orthologs are “evolutionary counterparts” – Koonin (2001)

Page 43: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Paralogs arise by duplications

ATCGGCCACTTTCGCGATCA

ATAGGGCAGTCTCGCGATTA ACCGGCCACCTTCGCGATCG

Sequence in ancestralOrganism

Paralogous sequences

Duplication event

Modern duplicate A Modern duplicate B

Page 44: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Hardison PNAS 2001 98 :1327-1329

We have different types of hemoglobins

The major adult hemoglobin is composed of 2 chains and 2 chains. The major fetal hemoglobin is composed of 2 chains and 2 chains.

Page 45: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

“There may thus exist a Molecular Evolutionary Clock”Zuckerkandl & Pauling (1965)

A model of sequence divergence can be used to extract the duplication dates of the difference hemoglobin chains

Duplication event

Primordial hemoglobin

Human Human Cow Cow

Speciation event

Note: This model explains why the distance betweem Human and Cow is shorter than Human – Human proximity.

Page 46: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

PBS Evolution Library (http://www.pbs.org/wgbh/evolution/library/)

Different clocks keep different times

Between horse and man

Page 47: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

The clock varies for different regions of the protein

For example, locations on the exterior of the protein may change at a different rate than those on the interior.

Page 48: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Ayala, F. Bioessays 1999 Jan;21(1):71-5

No universal clocks found!

Two terrible clocks

Page 49: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Ayala, F. Bioessays 1999 Jan;21(1):71-5

The common estimate is 1,100 My

Page 50: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

What causes deviations from the clock?

1. Generation time: Shorter generation time will accelerate the clock because it shortens the time to fix new mutations.

2. Mutation rate: Species-characteristic differences in polymerases or other biological properties that affect the fidelity of DNA replication, and hence the incidence of mutations.

3. Gene function: Changes in the function of a protein as evolutionary time proceeds. This might particularly be expected in the case of gene duplication.

4. Natural selection: Organisms are continually adapting to the physical and biotic environments, which change endlessly in patterns that are unpredictable and differently significant to different species.

Ayala, F. Bioessays 1999 Jan;21(1):71-5

Page 51: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

HIV Example 1:Florida dentist case

• 1990 case: Did a patient’s HIV infection result from an invasive dental procedure performed by an HIV+ dentist?

• HIV evolves so fast that transmission patterns can be reconstructed from viral sequence (molecular forensics).

• Compared viral sequence from the dentist, three of his HIV+ patients, and two HIV+ local controls.

Page 52: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

Florida dentist case

Page 53: Phylogeny – data mining by biologists Molecular phylogenetics is using clustering techniques to discern relationships between different biological sequences.

So what do the results mean?

• 2 of 3 patients closer to dentist than to local controls. Statistical significance? More powerful analyses?

• Do we have enough data to be confident in our conclusions? What additional data would help?

• If we determine that the dentist’s virus is linked to those of patients E and G, what are possible interpretations of this pattern? How could we test between them?