Lecture 24

17
Bioinformati cs Inferring molecular phylogeny Distance methods Discrete methods Comparisons of different tree building methods Estimating sampling error: the bootstrap Lecture 24

description

Bioinformatics. Lecture 24. Inferring molecular phylogeny Distance methods Discrete methods Comparisons of different tree building methods Estimating sampling error: the bootstrap. Inferring molecular phylogeny. - PowerPoint PPT Presentation

Transcript of Lecture 24

Page 1: Lecture 24

Bioinformatics

• Inferring molecular phylogeny

• Distance methods

• Discrete methods

• Comparisons of different tree building methods

• Estimating sampling error: the bootstrap

Lecture 24

Page 2: Lecture 24

Inferring molecular phylogeny

• The objective of molecular phylogenetics is to convert sequences information (DNA, RNA, proteins) into an evolutionary tree for this sequences.

• Ever growing number of tree building methods can very roughly be split into two approaches.

• Distance methods versus discrete characters methods.

• Clustering methods versus search methods.

• These methods will be considered during the lecture.

Page 3: Lecture 24

Distance methods

• The simplest distance method based on assumption of constant substitution rates and approximately equal length of neighboring branches called UPGMA (Unweighted Pair Group Method with Arithmetic Mean).

• A distance matrix, representing distances between all possible pairs of sequences used for the phylogenetic reconstruction must be built as a first step.

• The UPGMA starts from calculating branch length

Page 4: Lecture 24

Distance methods: an idealised case

A. Sequences

Sequence A ACGCGTTGGGCGATGGCAACSequence B ACGCGTTGGGCGACGGTAATSequence C ACGCATTGAATGATGATAATSequence B ACACATTGAGTGATAATAAT

B. Distances between sequences

nAB 3nAC 7nAD 8nBC 6nBD 7nCD 3

OTU A B C D

A - 3 7 8

B - - 6 7

C - - - 3

D - - - -

C. Distance table

D. The assumed unrooted tree

A C

DB

1

1

2

24

Page 5: Lecture 24

Diagram illustrating the stepwise construction of a phylogenetic tree for four OTUs according to unweighted pair group method with arithmetic

mean (UPGMA). The resulting tree is ultrametric. Methods used: distance and clustering.

8--C

1311-B

71114A

DCB

11-B

9.513.5AD

CB

A

D

dAD

2 d(AB)C

2d(ADC)B)

2

3.5

(AD)B = (AB + DB)/2

Values for these tables are calculated from the data presented in the initial table

(ADC)B = (AB + DB + CB)/3

A

D

C

3.5

4.75

6.33

A

D

C

3.5

4.75

B

12.67ADC

B

(AD)C = (AC + DC)/2

Page 6: Lecture 24

Neighbours-joining tree construction. Methods: distance and clustering.

OTU H C G O

C 1.45* - - -

G 1.51 1.57 - -

O 2.98 2.94 3.04 -

R 7.51 7.55 7.39 7.10

H – Human

C – Chimpanzee

G – Gorilla

O – Orangutan

R – Rhesus monkey

* Number of nucleotide substitutions per 100 sites between OTUs.

Page 7: Lecture 24

Neighbours-relation scores obtained from the distance matrix (see previous slide)

Calculation of the total scores:

(dHG + dCO) – min score

each pair (HG) and (CO) is assigned score of 1; other pairs score 0.

As a result the scores are obtained, which are shown in the table.

(OR) has the highest total score.

Page 8: Lecture 24

Building Neighbours-Joining (NJ) tree

5.225.255.25(OR)

1.571.51G

1.45C

GCHOTU

Treating (OR), which has the highest total score, as a separate single OUT, the following table can be calculated.

As only 4 OTUs are left, it is easy to see that dHC + dG(OR) = 6.67 <

< dHG + dC(OR) = 6.76 <

< dH(OR) + DCG = 6.82

Therefore, H and C are chosen as one pair of neighbours G and (OR) as the other.

Page 9: Lecture 24

Maximum parsimonyMethods: discrete characters and search/optimisation

Informative sites (*) in four compared sequences, used for phylogenetic reconstruction.

  Site

Sequence1 2 3 4 5 6 7 8 9

1 A A G A G T G C A

2 A G C C G T G C G

3 A G A T A T C C A

4 A G A G A T C C G

 Inf. sites         *   *   *

Page 10: Lecture 24

Three possible unrooted trees (I, II and III) for four DNA sequences (1, 2, 3, 4) that have been used to

choose the most parsimonious tree.

Page 11: Lecture 24

Comparison of different tree-building methods

• Efficiency (how fast is the method?),

• Power (how much data does the method need to produce reasonable result?)

• Consistency (will it converge on the right answer given enough data?)

• Robustness (will minor violations of the method’s assumptions result in poor estimates of phylogeny?)

• Falsibility (will the method tell when its assumption violated, in order to avoid using this method)

Page 12: Lecture 24

Performance of UPGMA and parsimony methods

UPGMA PARSIMONY

The success rate is the percentage of times that the correct tree was recovered in that region of the parameter space. White area in the left top of the both diagram, where non of the methods performs well

Page 13: Lecture 24
Page 14: Lecture 24

MEGA 3

Page 15: Lecture 24

MEGA3: Sequence Data Explorer

Variable sites

Parsimonious sites

Sequences continue

Page 16: Lecture 24

MEGA 3: phylogenetic trees

Neighbor- joining (NJ) Minimum evolution (ME)

Maximum Parsimony (MP) UPGMA

Page 17: Lecture 24

Bootstrapping

NJ ME

MP UPGMA