MP, ML, AML Reconstruction of Phylogenetic Trees: A Status ... › ~bchor › CG06 › ML.pdf ·...

Post on 05-Jul-2020

0 views 0 download

Transcript of MP, ML, AML Reconstruction of Phylogenetic Trees: A Status ... › ~bchor › CG06 › ML.pdf ·...

MP, ML, AML Reconstructionof Phylogenetic Trees:

A Status ReportBenny Chor

School of Computer ScienceTel-Aviv University

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.1

Phylogenetic Reconstruction• Input: A set of n aligned sequences (genes,proteins) from n species,

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.2

Phylogenetic Reconstruction• Input: A set of n aligned sequences (genes,proteins) from n species,

• Goal: Reconstruct the tree which best explainsthe evolutionary history of this gene/protein.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.2

Phylogenetic Reconstruction• Input: A set of n aligned sequences (genes,proteins) from n species,

• Goal: Reconstruct the tree which best explainsthe evolutionary history of this gene/protein.

• Tree reconstruction is still a challenge today.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.2

Phylogenetic Reconstruction• Input: A set of n aligned sequences (genes,proteins) from n species,

• Goal: Reconstruct the tree which best explainsthe evolutionary history of this gene/protein.

• Tree reconstruction is still a challenge today.• Many concrete questions are still unresolved (e.g.mammalian evolutionary tree).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.2

Phylogenetic Reconstruction• Input: A set of n aligned sequences (genes,proteins) from n species,

• Goal: Reconstruct the tree which best explainsthe evolutionary history of this gene/protein.

• Tree reconstruction is still a challenge today.• Many concrete questions are still unresolved (e.g.mammalian evolutionary tree).

• Most realistic formulations of the problem, whichtake errors into account, give rise to hardcomputational problems.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.2

Popular Methods• Distance based methods:

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:• Maximum Parsimony.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:• Maximum Parsimony.• Maximum Likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:• Maximum Parsimony.• Maximum Likelihood.

• Additional Methods:

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:• Maximum Parsimony.• Maximum Likelihood.

• Additional Methods:• Quartets Based.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Popular Methods• Distance based methods:

• UPGMA• Neighbor Joining.• Buneman trees.

• Character Based Methods:• Maximum Parsimony.• Maximum Likelihood.

• Additional Methods:• Quartets Based.• Disc Covering.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.3

Talk Outline• Maximum likelihood (ML).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.4

Talk Outline• Maximum likelihood (ML).• The likelihood surface.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.4

Talk Outline• Maximum likelihood (ML).• The likelihood surface.• Existence of multiple maxima.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.4

Talk Outline• Maximum likelihood (ML).• The likelihood surface.• Existence of multiple maxima.• Computation complexity: Maximum likelihoodvs. maximum parsimony (MP).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.4

Talk Outline• Maximum likelihood (ML).• The likelihood surface.• Existence of multiple maxima.• Computation complexity: Maximum likelihoodvs. maximum parsimony (MP).

• Ancestral maximum likelihood (AML) and itscomputational complexity.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.4

Maximum Likelihood• Input: A set of n observed sequences and anunderlying substitution model.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.5

Maximum Likelihood• Input: A set of n observed sequences and anunderlying substitution model.

• Desired Output: The weighted tree T thatmaximizes the likelihood of the data.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.5

Maximum Likelihood• Input: A set of n observed sequences and anunderlying substitution model.

• Desired Output: The weighted tree T thatmaximizes the likelihood of the data.

• Likelihood of a data: The conditional probabilityof producing the data, given the modelparameters.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.5

Maximum Likelihood• Input: A set of n observed sequences and anunderlying substitution model.

• Desired Output: The weighted tree T thatmaximizes the likelihood of the data.

• Likelihood of a data: The conditional probabilityof producing the data, given the modelparameters.

• Likelihood is a common optimization criteria innumerous settings, including phylogenetic(Felsenstein 1981).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.5

2–State Substitution Model

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

• Just two characters states, X and Y.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.6

2–State Substitution Model

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

• Just two characters states, X and Y.• Transitions between states are symmetric.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.6

2–State Substitution Model

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

• Just two characters states, X and Y.• Transitions between states are symmetric.• Equal rates across sites.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.6

2–State Substitution Model

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

• Just two characters states, X and Y.• Transitions between states are symmetric.• Equal rates across sites.• Every column induces a pattern.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.6

2–State Substitution Model

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

• Just two characters states, X and Y.• Transitions between states are symmetric.• Equal rates across sites.• Every column induces a pattern.• Remark: A simple model, yet very powerful.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.6

Neyman 2–State SubstitutionModel

e12

��

�e2

��

�e1

��

�e3

��

e123

1

2

3

4

For each edge e of a tree T , the edge weight pe rep-

resents the probability of having different states at the

two ends of e.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.7

A Very Simple ExampleFour species (n = 4), just one site (c = 1)

species observed data

1 X2 X3 Y4 Y

Analyze the natural tree (12)(34)

e12�

��e2

���

e1 ���e3

���e123

(1) X

(2) X

Y (3)

Y (4)

? ?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.8

Computing the LikelihoodEach unknown state (?) can assume one of twopossibilities, X or Y. For example, the assignment

p12�

��p2

���

p1 ���p3

���p123

(1) X

(2) X

Y (3)

Y (4)

X Y

contributes (1 − p1)·(1 − p2)·p12 ·(1 − p3)·(1 − p123).

The likelihood is the sum of this

+ three similar expressions. . .

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.9

Computing the Likelihood (2)

L( data | T, edge parameters)�

∑internal assignments

∏edges p

de(1 − p)�−de .

Each de is number of unequal sites along edge e. Itdepends on the internal assignment a, and inputpattern t at two ends of the edge.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.10

Computing the Likelihood (2)

L( data | T, edge parameters)�

∑internal assignments

∏edges p

de(1 − p)�−de .

Each de is number of unequal sites along edge e. Itdepends on the internal assignment a, and inputpattern t at two ends of the edge.

A well defined objective function to maximize.

Termed average likelihood by Penny and Steel.

Widely used in practice.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.10

Three Likelihood Versions• Big Likelihood: Given the sequence data, find atree and edge weights that maximizeL(data|tree & edge weights).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.11

Three Likelihood Versions• Big Likelihood: Given the sequence data, find atree and edge weights that maximizeL(data|tree & edge weights).

• Small Likelihood: Given observed data & a tree,but not the edge weights, find the edge weightsthat maximize the likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.11

Three Likelihood Versions• Big Likelihood: Given the sequence data, find atree and edge weights that maximizeL(data|tree & edge weights).

• Small Likelihood: Given observed data & a tree,but not the edge weights, find the edge weightsthat maximize the likelihood.

• Tiny Likelihood: Given observed data & a tree &edge weights, find the likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.11

Three Likelihood Versions• Big Likelihood: Given the sequence data, find atree and edge weights that maximizeL(data|tree & edge weights).

• Small Likelihood: Given observed data & a tree,but not the edge weights, find the edge weightsthat maximize the likelihood.

• Tiny Likelihood: Given observed data & a tree &edge weights, find the likelihood.

• Tiny likelihood can be efficiently computed usingdynamic programming (Felsenstein, 1981).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.11

Hill Climbing / Small Likelihood

• Typical approach to small likelihood, used inpractice:

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.12

Hill Climbing / Small Likelihood

• Typical approach to small likelihood, used inpractice:

• Start at some initial point with edge weights p.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.12

Hill Climbing / Small Likelihood

• Typical approach to small likelihood, used inpractice:

• Start at some initial point with edge weights p.• Apply hill climbing to the likelihood function, tillreaching a maximum.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.12

The Likelihood Surface• For hill climbing to be guaranteed to find themaximum, there must be a single local andglobal maximum in the parameter space.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.13

The Likelihood Surface• For hill climbing to be guaranteed to find themaximum, there must be a single local andglobal maximum in the parameter space.

• Fukami and Tateno (89), Tillier (94): For anytree, the ML point will be unique.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.13

The Likelihood Surface• For hill climbing to be guaranteed to find themaximum, there must be a single local andglobal maximum in the parameter space.

• Fukami and Tateno (89), Tillier (94): For anytree, the ML point will be unique.

• Steel (94): Proofs are erroneous - A simple butpathological counter example (multiple maximaon the wrong tree).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.13

The Likelihood Surface• For hill climbing to be guaranteed to find themaximum, there must be a single local andglobal maximum in the parameter space.

• Fukami and Tateno (89), Tillier (94): For anytree, the ML point will be unique.

• Steel (94): Proofs are erroneous - A simple butpathological counter example (multiple maximaon the wrong tree).

• (94–present): Hill climbing techniques still used.Steel’s counter example is considered too“biologically unrealistic” to warrant concern.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.13

The Likelihood Surface (cont.)• Rogers and Swofford (99): Simulation Study

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.14

The Likelihood Surface (cont.)• Rogers and Swofford (99): Simulation Study

• Data is simulated on a tree.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.14

The Likelihood Surface (cont.)• Rogers and Swofford (99): Simulation Study

• Data is simulated on a tree.• Multiple optima are rare...

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.14

The Likelihood Surface (cont.)• Rogers and Swofford (99): Simulation Study

• Data is simulated on a tree.• Multiple optima are rare...• ...especially on the correct tree.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.14

The Likelihood Surface (cont.)• Rogers and Swofford (99): Simulation Study

• Data is simulated on a tree.• Multiple optima are rare...• ...especially on the correct tree.

• Goal here: Investigate the problem analytically(joint work with Hendy, Holland, Penny).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.14

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).• Splits and sequence spectra (change of variables)

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).• Splits and sequence spectra (change of variables)• Constrained optimization.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).• Splits and sequence spectra (change of variables)• Constrained optimization.• Systems of polynomial equations.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).• Splits and sequence spectra (change of variables)• Constrained optimization.• Systems of polynomial equations.• Analytical solution: very hard in general, evenfor four taxa.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Maximizing Likelihood on TreesTools used

• Hadamard conjugation (Hendy and Penny 93).• Splits and sequence spectra (change of variables)• Constrained optimization.• Systems of polynomial equations.• Analytical solution: very hard in general, evenfor four taxa.

• Employing computer algebra and algebraicgeometry tools.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.15

Example: Conservative Data,Two Very Different ML Trees

species observed data

1 XXXXXXXYYY XXY XY YX XY X2 XXXXXXXYYY YYX YX YX YX X3 XXXXXXXYYY YYX XY XY XY X4 XXXXXXXYYY YYX XY XY YX Y

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.16

Example: Conservative Data,Two Very Different ML Trees

� �

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.16

Small Likelihood & MultipleMaxima

• Small Likelihood (reminder): Given observeddata & a tree, but not the edge weights, find theedge weights that maximize the likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.17

Small Likelihood & MultipleMaxima

• Small Likelihood (reminder): Given observeddata & a tree, but not the edge weights, find theedge weights that maximize the likelihood.

• Multiple ML points for general case imply smalllikelihood cannot be solved by hill climbing.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.17

Small Likelihood & MultipleMaxima

• Small Likelihood (reminder): Given observeddata & a tree, but not the edge weights, find theedge weights that maximize the likelihood.

• Multiple ML points for general case imply smalllikelihood cannot be solved by hill climbing.

• Not clear if small likelihood has efficient (worstcase) solutions.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.17

Maximum Parsimony (MP)• Big Parsimony: Given the sequence data, find atree and assignment of sequences to internalnodes that minimizes the number of changesacross all edges.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.18

Maximum Parsimony (MP)• Big Parsimony: Given the sequence data, find atree and assignment of sequences to internalnodes that minimizes the number of changesacross all edges.

• Small Parsimony: Given the sequence data and atree, find internal assignment(s) that minimizestotal number of changes.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.18

Maximum Parsimony (MP)• Big Parsimony: Given the sequence data, find atree and assignment of sequences to internalnodes that minimizes the number of changesacross all edges.

• Small Parsimony: Given the sequence data and atree, find internal assignment(s) that minimizestotal number of changes.

• MP considered by practitioners easier than ML.Indeed small parsimony has efficient algorithms(Fitch 1971, Sankoff and Cedergren 1983).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.18

Complexity of ReconstructionBoth MP and ML have well-defined objectivefunctions=⇒ Reconstruction is a computational problem.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.19

Complexity of ReconstructionBoth MP and ML have well-defined objectivefunctions=⇒ Reconstruction is a computational problem.

Number of trees over n leaves is exponential in n=⇒ Cannot exhaustively search all trees.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.19

Complexity: Small MP vs. ML• Small parsimony is in P.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.20

Complexity: Small MP vs. ML• Small parsimony is in P.• Small likelihood – unknown.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.20

Complexity: Big MP vs. MLIs ML Computationally Intractable?

Big MP known for almost 20 years to becomputationally intractable [Day et al., 1986,reduction from vertex cover] .

No such result has been found for Big ML to date(2004).

Tuffley and Steel (1997): Relations betweenlikelihood and parsimony.

Addario-Berry et al. (2003): Big Ancestral ML ishard.

Still, no cigar (and not even close).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.21

Ancestral ML (AML)• A tree reconstruction method that is “in between”ML and MP.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.22

Ancestral ML (AML)• A tree reconstruction method that is “in between”ML and MP.

• The goal is to simultaneously find edge weightsand assignment of sequences to internal nodes sothat the likelihood of the data, given the treeparameters, is maximized.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.22

Ancestral ML (AML)• A tree reconstruction method that is “in between”ML and MP.

• The goal is to simultaneously find edge weightsand assignment of sequences to internal nodes sothat the likelihood of the data, given the treeparameters, is maximized.

• AML is widely used in evolutionary studies.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.22

Ancestral ML (AML)• A tree reconstruction method that is “in between”ML and MP.

• The goal is to simultaneously find edge weightsand assignment of sequences to internal nodes sothat the likelihood of the data, given the treeparameters, is maximized.

• AML is widely used in evolutionary studies.• Also termed joint reconstruction of ancestralsequences.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.22

Ancestral ML (AML)• A tree reconstruction method that is “in between”ML and MP.

• The goal is to simultaneously find edge weightsand assignment of sequences to internal nodes sothat the likelihood of the data, given the treeparameters, is maximized.

• AML is widely used in evolutionary studies.• Also termed joint reconstruction of ancestralsequences.

• AML computes the likelihood contributionresulting from best assignment to internal nodes,while “regular ML” sums up over all assignments.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.22

Two AML Versions• Big AML: Given the sequence data, find a tree,assignment to internal nodes, and edge weightsthat maximize the likelihood of the data.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.23

Two AML Versions• Big AML: Given the sequence data, find a tree,assignment to internal nodes, and edge weightsthat maximize the likelihood of the data.

• Small AML: Given observed data, a tree andedge weights, but not the internal assignment,find the assignment that maximize the likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.23

Two AML Versions• Big AML: Given the sequence data, find a tree,assignment to internal nodes, and edge weightsthat maximize the likelihood of the data.

• Small AML: Given observed data, a tree andedge weights, but not the internal assignment,find the assignment that maximize the likelihood.

• PPSG 2000: A poly time, dynamic programmingalgorithm for small AML.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.23

Two AML Versions• Big AML: Given the sequence data, find a tree,assignment to internal nodes, and edge weightsthat maximize the likelihood of the data.

• Small AML: Given observed data, a tree andedge weights, but not the internal assignment,find the assignment that maximize the likelihood.

• PPSG 2000: A poly time, dynamic programmingalgorithm for small AML.

• Remark: Version where tree is given but no edgeweights or assignment is still open.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.23

Two AML Versions• Big AML: Given the sequence data, find a tree,assignment to internal nodes, and edge weightsthat maximize the likelihood of the data.

• Small AML: Given observed data, a tree andedge weights, but not the internal assignment,find the assignment that maximize the likelihood.

• PPSG 2000: A poly time, dynamic programmingalgorithm for small AML.

• Remark: Version where tree is given but no edgeweights or assignment is still open.

• ACHLPW 2003: Big AML is NP-hard.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.23

Useful AML Observation• Given sequence data, a tree, and assignment tointernal nodes.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.24

Useful AML Observation• Given sequence data, a tree, and assignment tointernal nodes.

• The edge weights that maximize the likelihood ofthe data equal de/k.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.24

Useful AML Observation• Given sequence data, a tree, and assignment tointernal nodes.

• The edge weights that maximize the likelihood ofthe data equal de/k.

• Where de equals the number of changes accrossedge e, and k is the common sequence length.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.24

AML, ReformulatedPrevious observation implies

• Input: A set S of n binary sequences, each oflength

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.25

AML, ReformulatedPrevious observation implies

• Input: A set S of n binary sequences, each oflength

• Goal: Find a tree T with n leaves, an assignmentp : E(T ) → [0, 1] of edge probabilities, and alabelling λ : V (T ) → {0, 1}k of the vertices suchthat

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.25

AML, ReformulatedPrevious observation implies

• Input: A set S of n binary sequences, each oflength

• Goal: Find a tree T with n leaves, an assignmentp : E(T ) → [0, 1] of edge probabilities, and alabelling λ : V (T ) → {0, 1}k of the vertices suchthat1. The n labels of the leaves are exactly thesequences from S.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.25

AML, ReformulatedPrevious observation implies

• Input: A set S of n binary sequences, each oflength

• Goal: Find a tree T with n leaves, an assignmentp : E(T ) → [0, 1] of edge probabilities, and alabelling λ : V (T ) → {0, 1}k of the vertices suchthat1. The n labels of the leaves are exactly thesequences from S.

2. the sum of all “edge entropies”∑e∈E(T ) H (de/k) is minimized.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.25

AML, ReformulatedPrevious observation implies

• Input: A set S of n binary sequences, each oflength

• Goal: Find a tree T with n leaves, an assignmentp : E(T ) → [0, 1] of edge probabilities, and alabelling λ : V (T ) → {0, 1}k of the vertices suchthat1. The n labels of the leaves are exactly thesequences from S.

2. the sum of all “edge entropies”∑e∈E(T ) H (de/k) is minimized.

• H(p) = −p log(p) − (1 − p) log(1 − p) is thebinary entropy function.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.25

AML vs. MPOptimization criteria

• Input: A set S of n binary sequences, each oflength k.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.26

AML vs. MPOptimization criteria

• Input: A set S of n binary sequences, each oflength k.

• AML: Minimize the sum of all “edge entropies”∑e∈E(T ) H (de/k).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.26

AML vs. MPOptimization criteria

• Input: A set S of n binary sequences, each oflength k.

• AML: Minimize the sum of all “edge entropies”∑e∈E(T ) H (de/k).

• MP: Minimize the sum of all “edge differences”∑e∈E(T ) de/k.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.26

AML vs. MPOptimization criteria

• Input: A set S of n binary sequences, each oflength k.

• AML: Minimize the sum of all “edge entropies”∑e∈E(T ) H (de/k).

• MP: Minimize the sum of all “edge differences”∑e∈E(T ) de/k.

• Can think of the two problems as attempting tominimize different edge weights (functions of de).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.26

NP hardness of AML: Ideas• MP was shown NP-hard by Day, Johnson,Sankoff using reduction from vertex cover (VC).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.27

NP hardness of AML: Ideas• MP was shown NP-hard by Day, Johnson,Sankoff using reduction from vertex cover (VC).

• Analogy of AML and MP optimization criteriasuggests using similar approach.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.27

NP hardness of AML: Ideas• MP was shown NP-hard by Day, Johnson,Sankoff using reduction from vertex cover (VC).

• Analogy of AML and MP optimization criteriasuggests using similar approach.

• Reduction from VC indeed identical.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.27

NP hardness of AML: Ideas• MP was shown NP-hard by Day, Johnson,Sankoff using reduction from vertex cover (VC).

• Analogy of AML and MP optimization criteriasuggests using similar approach.

• Reduction from VC indeed identical.• Proof substantially more involved as entropy

H (de/k) is not as “well behaved” as plain edgedifferences de/k.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.27

Open Problems as of 2004• Hardness proof for big AML as a stepping stonefor big ML?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.28

Open Problems as of 2004• Hardness proof for big AML as a stepping stonefor big ML?

• Is small ML in poly-time?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.28

Our Major QuestionIs ML Computationally Intractable?

MP known for almost 20 years to becomputationally intractable [Day et al., 1986,translation from vertex cover] .

No such result has been found for ML to date.

Tuffley and Steel (1997): Relations betweenlikelihood and parsimony.

Addario-Berry et al. (2003): Ancestral ML ishard.

Still, no cigar (and not even close).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.29

Is ML Computationally In-tractable?

Still, no cigar (and not even close).

Particularly frustrating in light of intuition amongpractitioners that ML is harder than MP.

Maybe some slick and efficient ML algorithmlurks out there, waiting to be discovered?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.30

Is ML Computationally In-tractable?

Still, no cigar (and not even close).

Particularly frustrating in light of intuition amongpractitioners that ML is harder than MP.

Maybe some slick and efficient ML algorithmlurks out there, waiting to be discovered?

CT2005:ML is computationally hard (NP complete)=⇒ No such algorithm exists (unless P=NP).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.30

Intractability Proof: The BigPictureEfficiently translate vertex cover (VC) to ML.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.31

Intractability Proof: The BigPictureEfficiently translate vertex cover (VC) to ML.

=⇒

k�0�

110..0�0110..0�

000110..0� 00110..0�

0000110..0�

00000110..0�

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.31

Intractability Proof: The BigPictureEfficiently translate vertex cover (VC) to ML.

=⇒

k�0�

110..0�0110..0�

000110..0� 00110..0�

0000110..0�

00000110..0�

“Translation” means

Small cover =⇒ Large likelihood.

Large cover =⇒ Small likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.31

Vertex Cover in GraphsGiven a graph (V,E)

find a small set of vertices C

such that for each edge in the graph,

C contains at least one endpoint.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.32

Vertex Cover in GraphsGiven a graph (V,E)

find a small set of vertices C

such that for each edge in the graph,

C contains at least one endpoint.

(figure from www.cc.ioc.ee/jus/gtglossary/assets/vertex_cover.gif)

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.32

Well Known: Vertex Cover isIntractableThe decision version of this problem (does G has acover of size ≤ c) is computationally intractable.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.33

Well Known: Vertex Cover isIntractableThe decision version of this problem (does G has acover of size ≤ c) is computationally intractable.

(figure from http://wwwbrauer.in.tum.de/gruppen/theorie/hard/vc1.png)MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.33

Maximum Likelihood: DecisionProblem

Input: A set of equi length binary sequencesS1, S2, . . . , Sm, and a real number, D.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.34

Maximum Likelihood: DecisionProblem

Input: A set of equi length binary sequencesS1, S2, . . . , Sm, and a real number, D.

Question: Is there a tree T and edge lengths such thatlog2L(S1, S2, . . . , Sm | T, edge parameters) > D?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.34

Maximum Likelihood: DecisionProblem

Input: A set of equi length binary sequencesS1, S2, . . . , Sm, and a real number, D.

Question: Is there a tree T and edge lengths such thatlog2L(S1, S2, . . . , Sm | T, edge parameters) > D?

Notice: Yes/No question.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.34

Translating VC to MLThe following graph, with 5 vertices and 6 edges

1 3 2

4 5

Translates to

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.35

Translating VC to MLThe following graph, with 5 vertices and 6 edges

1 3 2

4 5

Translates to

A set with 7 binary sequences, each of length 5:

00000 10100 10010 01100

01001 00110 00101

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.35

Translating VC to MLIf G has n vertices andm edges, will constructm + 1 binary sequences, each of length n.

One sequence is all zeroes.

For every edge (i, j) ∈ E, have the sequencei−1︷ ︸︸ ︷

00..00 1

j−i−1︷ ︸︸ ︷00..00 1

n−j︷ ︸︸ ︷00..00︸ ︷︷ ︸

n

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.36

Relation between likelihood andparsimony (Tuffley and Steel)

L(S|T ) ≡ Pr(S|p∗, T ) ≥ 2− log(kc)·pars(S,T )−Cd

L(S|T ) ≡ Pr(S|p∗, T ) ≤ 2− log(kc)·pars(S,T )−Cu

Cu and Cd are sub quadratic functions of the size of|V (T )|, pars(S, T ), and k − kc.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.37

Relation between likelihood andparsimony (Tuffley and Steel)

L(S|T ) ≡ Pr(S|p∗, T ) ≥ 2− log(kc)·pars(S,T )−Cd

L(S|T ) ≡ Pr(S|p∗, T ) ≤ 2− log(kc)·pars(S,T )−Cu

Cu and Cd are sub quadratic functions of the size of|V (T )|, pars(S, T ), and k − kc.

Conclusion: if (k − kc) = O(|V (T )|) (as our case)then

L(S|T ) = O(pars(S, T ) · log(n)) + O(|V (T )|2)+O(pars(S, T ) log(pars(S, T ))

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.37

Canonical Trees: Definition1. Tree has an internal node (called the “root” ) with 0 length

edge to the all zero leaf.

2. All leaves are at distance one or two from the root.

3. Subtrees of distance two leaves contains one, two, or three

leaves. All sequences in a subtree with two or three leaves

share a “1” in same position.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.38

Canonical Trees: Definition1. Tree has an internal node (called the “root” ) with 0 length

edge to the all zero leaf.

2. All leaves are at distance one or two from the root.

3. Subtrees of distance two leaves contains one, two, or three

leaves. All sequences in a subtree with two or three leaves

share a “1” in same position.

k�0�

11000000..000�01100000..000�

00000110..000� 00001100..000�

10001000..000�

10000100..000�

00000000..011�

00000010..001�

00000011..000�

00000010..010�

k�0�

Root�

Yes NoMP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.38

Canonical Trees and VertexCovers[Day, 1986]: A canonical tree with degree d at rootexists

⇐⇒G has a cover of size d.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.39

Canonical Trees and Likelihood

Now establish relationship between degree of root ofcanonical trees and likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.40

Canonical Trees and Likelihood

Now establish relationship between degree of root ofcanonical trees and likelihood.

Given a canonical tree T withm + 1 leaves labelledby sequences from S. Let d denote the degree of theroot. Then for the optimal edge lengths p∗,

log(Pr(S |T, p∗)) = −(m + d) · log n + θ(n) .

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.40

Canonical Trees and Likelihood

Now establish relationship between degree of root ofcanonical trees and likelihood.

Given a canonical tree T withm + 1 leaves labelledby sequences from S. Let d denote the degree of theroot. Then for the optimal edge lengths p∗,

log(Pr(S |T, p∗)) = −(m + d) · log n + θ(n) .

So as n → ∞,− log(L)

(m + d) log(n)→ 1 .

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.40

Do We Have the Desired Proof?− log(L)

(m + d) log(n)→n→∞ 1 .

Seems to imply

Small cover =⇒ Large likelihood.

Large cover =⇒ Small likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.41

Do We Have the Desired Proof?− log(L)

(m + d) log(n)→n→∞ 1 .

Seems to imply

Small cover =⇒ Large likelihood.

Large cover =⇒ Small likelihood.

Take it easy. There is a problem here. What weactually showed is a reduction from VC to ML ofcanonical trees.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.41

Do We Have the Desired Proof?− log(L)

(m + d) log(n)→n→∞ 1 .

Seems to imply

Small cover =⇒ Large likelihood.

Large cover =⇒ Small likelihood.

Take it easy. There is a problem here. What weactually showed is a reduction from VC to ML ofcanonical trees.

But ML tree need not be canonical!

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.41

All Is Lost?

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

Carry out a sequence of gentle modification.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

Carry out a sequence of gentle modification.

Each modification may decrease log likelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

Carry out a sequence of gentle modification.

Each modification may decrease log likelihood.

But only by a little (B log n).

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

Carry out a sequence of gentle modification.

Each modification may decrease log likelihood.

But only by a little (B log n).

Number of modification small enough.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

All Is Lost?

Not necessarily.

Starting from any ML tree (arbitrary shape).

Carry out a sequence of gentle modification.

Each modification may decrease log likelihood.

But only by a little (B log n).

Number of modification small enough.

So accumulated loss in log likelihood is small.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.42

Hardness ConclusionMaximum likelihood of phylogenetic tree iscomputationally intractable

No magic bullet!

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.43

Open Problems and Further Re-search

Four states characters & beyond.√

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

ML hardness under molecular clock.√

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

ML hardness under molecular clock.√

Efficient approximation algorithms.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

ML hardness under molecular clock.√

Efficient approximation algorithms.

Efficient algorithms for small likelihood: Givenobserved data & a tree, but not the edge weights,find the edge weights that maximize thelikelihood.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

ML hardness under molecular clock.√

Efficient approximation algorithms.

Efficient algorithms for small likelihood: Givenobserved data & a tree, but not the edge weights,find the edge weights that maximize thelikelihood.

Regions of input parameters where ML can beefficiently solved.

MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44

Open Problems and Further Re-search

Four states characters & beyond.√

Hardness of ML approximation.√

ML hardness under molecular clock.√

Efficient approximation algorithms.

Efficient algorithms for small likelihood: Givenobserved data & a tree, but not the edge weights,find the edge weights that maximize thelikelihood.

Regions of input parameters where ML can beefficiently solved.

♣MP, ML, AML Reconstructionof Phylogenetic Trees:A Status Report – p.44