Maximum likelihood (ML) method

22
Maximum likelihood (ML) method Jarno Tuimala Thanks to James McInerney for the slides with a darker background!

description

Maximum likelihood (ML) method. Jarno Tuimala Thanks to James McInerney for the slides with a darker background!. Maximum likelihood. Historically a new method (Felsenstein, 1980’s) ML assumes a model of sequence evolution - PowerPoint PPT Presentation

Transcript of Maximum likelihood (ML) method

Page 1: Maximum likelihood (ML) method

Maximum likelihood (ML) method

Jarno Tuimala

Thanks to James McInerney for the slides with a darker background!

Page 2: Maximum likelihood (ML) method

Maximum likelihood

• Historically a new method (Felsenstein, 1980’s)

• ML assumes a model of sequence evolution

• Using the model, ML method tries to answer the question: what is the likelihood (conditional probability) of observing these data given a certain model

Page 3: Maximum likelihood (ML) method

Maximum Likelihood - goal

• To estimate the probability that we would observe a particular dataset, given a phylogenetic tree and some notion of how the evolutionary process worked over time.

Probability of given )

a b c d

b a e f

c e a g

d c f a

(

a ,c,g,t

Page 4: Maximum likelihood (ML) method

Probability of observing a sequence

• What is the probability of observing a sequence ACGT, if– p(a)=p(c)=p(g)=p(t)=0.25 ?– Assumption: sequence sites evolve

independently

• P(ACGT) = p(a)*p(c)*p(g)*p(t)

= 0.25*0.25*0.25*0.25

= 0.00390625

• LogP = log(0.00390625) = -5.545177

Page 5: Maximum likelihood (ML) method

Substitution matrix

• For nucleotide sequences, there are 16 possible ways to describe substitutions - a 4x4 matrix.

P

a b c d

e f g h

i j k l

m n o p

Note: for amino acids, the matrix is a 20 x 20 matrix and for codon-based models, the matrix is 61 x 61

Convention dictates that the order of the nucleotides is a,c,g,t

Page 6: Maximum likelihood (ML) method

Does changing a model affect the outcome?

There are different modelsJukes and Cantor (JC69):

All base compositions equal (0.25 each), rate of change from one base to another is the same

Kimura 2-Parameter (K2P):All base compositions equal (0.25 each), different substitution rate for transitions and transversions).

Hasegawa-Kishino-Yano (HKY):Like the K2P, but with base composition free to vary.

General Time Reversible (GTR):Base composition free to vary, all possible substitutions can differ.

All these models can be extended to accommodate invariable sites and site-to-site rate variation.

Page 7: Maximum likelihood (ML) method

Probability of observing a sequence change 1/2

• Alignment: ACCT

GCCT• Change probabilities (Jukes-Cantor, μ=0.1):

• Tree: ACCT – GCCT• Nucleotide frequences: p(a)=p(c)=p(g)=p(t)=0.25

Page 8: Maximum likelihood (ML) method

Probability of observing a sequence change 2/2

• P(ACCT, GCCT) =

∏ik (frequency*change probability)

• P(ACCT, GCCT) =0.25*0.0062*0.25*0.9815*0.25*0.9815*

0.25*0.9815 = 0.00002289932

• Log(P(ACCT, GCCT)) = -4.64

Page 9: Maximum likelihood (ML) method

Different Branch Lengths

• For very short branch lengths, the probability of a character staying the same is high and the probability of it changing is low (for our particular matrix).

• For longer branch lengths, the probability of character change becomes higher and the probability of staying the same is lower.

• The previous calculations are based on the assumption that the branch length describes one Certain Evolutionary Distance or CED.

• If we want to consider a branch length that is twice as long (2 CED), then we can multiply the substitution matrix by itself (matrix2).

X =

Page 10: Maximum likelihood (ML) method

Optimizing the branch lengths

Page 11: Maximum likelihood (ML) method

Invariable sites• For a given dataset we can assume that a certain proportion of sites are not

free to vary - purifying selection (related to function) prevents these sites from changing).

• We can therefore observe invariable positions either because they are under this selective constraint or because they have not had a chance to vary or because there is homoplasy in the dataset and a reversal (say) has caused the site to appear constant.

• The likelihood that a site is invariable can be calculated by incorporating this possibility into our model and calculating for every site the likelihood that it is an invariable site.

• It might improve the likelihood of the dataset if we remove a certain proportion of invariable sites (in a way that is analogous to the preceding discussion).

Page 12: Maximum likelihood (ML) method

Variable sites

• Obviously other sites in the dataset are free to vary.• Selection intensity on these sites is rarely uniform, so it is

desirable to model site-by-site rate variation.• This is done in two ways:

– site specific (codon position, or alpha helix etc.)

– using a discrete approximation to a continuous distribution (gamma distribution).

• Again, these variables are modeled over all possibilities of sequence change over all possibilities of branch length over all possibilities of tree topology.

Page 13: Maximum likelihood (ML) method

The shape of the gamma distribution for different values of alpha.

Page 14: Maximum likelihood (ML) method

Incorporating gamma 1/2

• Alignment: ACCT GCCT• Change probabilities (Jukes-Cantor, μ=0.1):

• Tree: ACCT – GCCT• Nucleotide frequences: p(a)=p(c)=p(g)=p(t)=0.25• Two gamma classes, p(g1)=0.8, p(g2)=0.2

Page 15: Maximum likelihood (ML) method

Incorporating gamma 2/2

• P(ACCT, GCCT) = • (0.25*0.0062*0.8 + 0.25*0.0062*0.2)*

(0.25*0.9815*0.8 + 0.25*0.9815*0.2)* (0.25*0.9815*0.8 + 0.25*0.9815*0.2)* (0.25*0.9815*0.8 + 0.25*0.9815*0.2) =0.00002289932

• Log(P(ACCT, GCCT)) = -4.64• Using gamma, more calculations are done,

and more time is consumed

Page 16: Maximum likelihood (ML) method

Selecting the correct model 1/4

• It was previously pointed out that parsimony can be inconsistent.

• ML can be inconsistent too!

• If the model used in the ML analysis is incorrect, the method might become inconsistent.

• Before analysis, the correct model should be selected.

Page 17: Maximum likelihood (ML) method

Selecting the correct model 2/4

Page 18: Maximum likelihood (ML) method

Selecting a correct model 3/4

Page 19: Maximum likelihood (ML) method

Selecting a correct model 4/4

Page 20: Maximum likelihood (ML) method

Checking for saturation

Page 21: Maximum likelihood (ML) method

Likelihood mapping with TreePuzzle

Page 22: Maximum likelihood (ML) method

Practical issues

• There is an ML equivalent to Wagner method for generating initial trees, but it is very slow.

• Many programs create an initial tree using parsimony or distance methods or use a completely random tree.

• Search strategy is similar to parsimony:– 100 RAS + TBR for small dataset– In addition, simulated annealing can be used for

larger datasets