Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for...

Post on 03-Aug-2021

1 views 0 download

Transcript of Using algebraic geometry for phylogenetic reconstruction...Title Using algebraic geometry for...

Using algebraic geometry for phylogeneticreconstruction

Marta Casanellas i Rius(joint work with Jesus Fernandez-Sanchez)

Departament de Matematica Aplicada IUniversitat Politecnica de Catalunya

IMA WorkshopApplications in Biology, Dynamics, and Statistics

March 8, 2007

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 1 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 2 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 3 / 36

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 4 / 36

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 4 / 36

AlignmentMutations, deletions and insertions of nucleotides occur along thespeciation process.

Given two DNA sequences, an alignment is a correspondencebetween them that accounts for their differences. The optimalalignment is the one that minimizes the number of mutations,deletions and insertions.seq1 : ACGTAGCTAAGTTA... seq2 : ACCGAGACCCAGTA...

A possible alignment is:

seq1 A C − G − T A − G C T A A G T T Aseq2 A C C G A G A C− C C A − G T − A

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 5 / 36

AlignmentMutations, deletions and insertions of nucleotides occur along thespeciation process.

Given two DNA sequences, an alignment is a correspondencebetween them that accounts for their differences. The optimalalignment is the one that minimizes the number of mutations,deletions and insertions.seq1 : ACGTAGCTAAGTTA... seq2 : ACCGAGACCCAGTA...

A possible alignment is:

seq1 A C − G − T A − G C T A A G T T Aseq2 A C C G A G A C− C C A − G T − A

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 5 / 36

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

AACTTCGAGGCTTACCGCTG

AAGGTCGATGCTCACCGATG

AACGTCTATGCTCACCGATG

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 6 / 36

Phylogenetic reconstruction

Given n current species, e.g. HUMAN, GORILLA, CHIMP,given part of their genome and an alignment alignment

AACTTCGAGGCTTACCGCTG

AAGGTCGATGCTCACCGATG

AACGTCTATGCTCACCGATG

Goal: to reconstruct their ancestral relationships (phylogeny):

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 6 / 36

Algebraic evolutionary models

Assume that all sites of the alignment evolve equally andindependently.At each node of the tree we put a random variable taking values in{A,C,G,T }

ti = branch length(represents the number ofmutations per site along thatbranch)

Variables at the leaves are observed and variables at the interiornodes are hidden.At each branch we write a matrix (substitution matrix) with theprobabilities of a nucleotide at the parent node being substitutedby another at its child.

A C G T

S =

ACGT

P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 7 / 36

Algebraic evolutionary models

Assume that all sites of the alignment evolve equally andindependently.At each node of the tree we put a random variable taking values in{A,C,G,T }

ti = branch length(represents the number ofmutations per site along thatbranch)

Variables at the leaves are observed and variables at the interiornodes are hidden.At each branch we write a matrix (substitution matrix) with theprobabilities of a nucleotide at the parent node being substitutedby another at its child.

A C G T

S =

ACGT

P(A|A) P(C|A) P(G|A) P(T |A)P(A|C) P(C|C) P(G|C) P(T |C)P(A|G) P(C|G) P(G|G) P(T |G)P(A|T ) P(C|T ) P(G|T ) P(T |T )

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 7 / 36

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

The entries of Si are unknown parameters.

Example (Group-based models. Kimura 3-parameters)Root node has uniform distribution: πA = ... = πT = 0.25

A C G T

Si =ACGT

(ai bi ci dibi ai di cici di ai bidi ci bi ai

), ai + bi + ci + di = 1

Kimura 2-parameters: bi = di .

Jukes-Cantor: bi = ci = di .

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 8 / 36

Hidden Markov processWe denote the joint distribution of the observed variables X1, X2, X3 aspx1x2x3 = Prob(X1 = x1, X2 = x2, X3 = x3).

px1x2x3 =∑

y4,yr∈{A,C,G,T}

πyr S1(x1, yr )S4(y4, yr )S2(x2, y4)S3(x3, y4)

px1x2x3 is a homogeneous polynomial on the parameters whosedegree is the number of edges.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 9 / 36

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

The evolutionary model defines a polynomial map

ϕ : Rd −→ R4n

θ = (θ1, . . . , θd) 7→ (pAA...A, pAA...C, pAA...G, . . . , pTT...T)

ϕ : 4d−1 −→ 44n−1

ϕ : Cd −→ C4n

Algebraic variety V = imϕ, closure in Zariski topology.

Given an alignment, we can estimate the joint probability px1,x2,x3

as the relative frequency of column x1x2x3 in the alignment.

In the theoretical model, this would be a point on the variety V .

Goal: use the ideal of V , I(V ) ⊂ R[pAA...A, pAA...C, . . . , pTT...T], toinfer the topology of the phylogenetic tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 10 / 36

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Computing the ideal(Eriksson–Ranestad–Sturmfels–Sullivant ’05)

Some generators of I(V ) depend only on the model chosen (noton the topology). E.g. for Jukes-Cantor they are:

∑px1x2x3 = 1

pAAA = pCCC = pGGG = pTTT 4 terms

pAAC = pAAG = pAAT = · · · = pTTG 12 terms

pACA = pAGA = pATA = · · · = pTGT 12 terms

pCAA = pGAA = pTAA = · · · = pGTT 12 terms

pACG = pACT = pAGT = · · · = pCGT 24 termsFor this model, the unique polynomial that detects the phylogenyof the three species has degree 3.

Definition (Cavender–Felsenstein ’87)The generators of I(V ) are called phylogenetic invariants.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 11 / 36

Problem: computation of invariants

Computational algebra software fail to compute the ideal for ≥ 4species! Kimura 3-parameter, 4 species, 8002 generators like:

[Small trees webpage]

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 12 / 36

Problem: computation of invariants

Computational algebra software fail to compute the ideal for ≥ 4species! Kimura 3-parameter, 4 species, 8002 generators like:

[Small trees webpage]

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 12 / 36

Group-based models. Discrete Fourier transform

Group-based models (Kimura and Jukes-Cantor): G = Z2 × Z2,A=(0,0),C=(0,1),G=(1,0),T=(1,1) . Substitution matricesare functions on the group:

Si(gp, gc) = f i(gp − gc), gc , gp ∈ G.

pg1,...,gm = p(g1, . . . , gm) =14

∑ ∏e∈edges

f e(gp(e) − gc(e))

sum over all possible values at interior nodes.

Discrete Fourier transform f : G −→ C

f (χ) =∑g∈G

χ(g)f (g), χ ∈ Hom(G, C∗) ∼= G

Convolution:(f1 ∗ f2)(g) =∑h∈G

f1(h)f2(g − h) =⇒ f1 ∗ f2 = f1 · f2

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 13 / 36

Group based models.

Theorem (Evans–Speed)For a group-based model (i.e. Kimura 3, Kimura 2 or Jukes-Cantor) ona tree T, the discrete Fourier transform of the joint distributionp(g1, . . . , gn) has the following form

q(χ1, . . . , χm) =∏

e∈edges

f e( ∏

l∈leaves below e

χl)

Using a linear change of coordinates, one gets a monomialparameterization of the variety associated to the model

ϕ : Cd −→ C4n

θ = (θ1, . . . , θd) 7→ (qAA...A, qAA...C, qAA...G, . . . , qTT...T)

so that it is a toric variety.

The ideal is generated by binomials.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 14 / 36

Fourier transform for Kimura 3-parameter model

Parameters Fourier parameters

Se =

ae be ce de

be ae de ce

ce de ae be

de ce be ae

Pe =

Pe

A 0 0 00 Pe

C 0 00 0 Pe

G 00 0 0 Pe

T

Pe

A=ae+be+ce+de, PeC=ae−be+ce−de...

simplex: 43 ∆3

ae+be+ce+de=1 PeA=1

coordinates: px1...xn qg1...gn , g1+···+gn=0

simplex: 44n−1 ∆4n−1−1∑px1...xn = 1 qA...A = 1

Linear invariants: qg1...gn = 0 if g1 + · · ·+ gn 6= 0 in Z2 × Z2.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 15 / 36

For the Kimura 3-parameter model, we are interested inV := im(ϕ) where

ϕ :∏

e∈E(T )

C4 −→ C4n−1

(PeA,Pe

C ,PeG,Pe

T )e

7→ (qx1...xn){x1,...,xn|x1+···+xn=0}

But ∆4n−1−1 ⊂ {qA...A = 1}.

DefinitionThe Kimura variety of the phylogenetic tree T is W := V ∩ {qA...A = 1}.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 16 / 36

Recursive construction of invariants

Computational algebra software: even in Fourier parameterization,fail for ≥ 5 leaves.

Theorem (Sturmfels – Sullivant ’05)For any group-based model, they give an explicit algorithm forobtaining the generators of the ideal of phylogenetic invariants of ann-leaved tree from those of an unrooted tree of 3 leaves. Thegenerators are binomials of degree ≤ 4.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 17 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 18 / 36

Phylogenetic inference using algebraic geometry

1990-1995: Biologists claim that phylogenetic invariants are notuseful for phylogenetic reconstruction.Lake only used linear invariants!

In particular, a work of Huelsenbeck’95 reveals the inefficiency ofLake’s method of invariants.

(Eriksson ’05) for General Markov Model.

(C–Garcia–Sullivant ’05) for Kimura 3-parameter.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 19 / 36

Phylogenetic inference using algebraic geometry

1990-1995: Biologists claim that phylogenetic invariants are notuseful for phylogenetic reconstruction.Lake only used linear invariants!

In particular, a work of Huelsenbeck’95 reveals the inefficiency ofLake’s method of invariants.

(Eriksson ’05) for General Markov Model.

(C–Garcia–Sullivant ’05) for Kimura 3-parameter.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 19 / 36

Phylogenetic inference using algebraic geometry

Model: Unrooted tree of 4 leaves, Kimura 3-parameters.Naive algorithm: Take || · ||1 of the evaluation of all invariants onthe point given by the relative frequencies of the columns of thealignment. Obtain a score for each possible topology and choosethe one with smaller score.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 20 / 36

Phylogenetic inference using algebraic geometry

Model: Unrooted tree of 4 leaves, Kimura 3-parameters.Naive algorithm: Take || · ||1 of the evaluation of all invariants onthe point given by the relative frequencies of the columns of thealignment. Obtain a score for each possible topology and choosethe one with smaller score.

Huelsenbeck ’95: assesses the performance of different phylogeneticreconstruction methods on the following tree space.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 20 / 36

Results on simulated data (Huelsenbeck)

Huelsenbeck’s results:

Length 100 Length 500 Length 1000

Lake invariants

NJ

ML

Huelsenbeck, Syst. Biol., 1995M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 21 / 36

C–Fernandez-Sanchez studies on C–Garcia–Sullivantmethod

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 22 / 36

Why using phylogenetic invariants

1 We do not need to estimate parameters of the model.2 The algebraic model allows species within the tree to evolve at

different mutation ratesIn the non-algebraic version, one has: Si = exp(Q · ti) where Q is afixed matrix that represents the instantaneous mutation rate of allthe species in the tree.In the algebraic model, one is allowing different rates at eachbranch of the tree.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 23 / 36

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

For instance using algebraic geometry one should be able toreconstruct the biologically correct tree:

dog human chimp rat mouse chickendog human chimp rat mouse chicken

(Al-Aidroos, Snir ’05) chapter 21 in ASCB book.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 24 / 36

Simulations on non-homogeneous trees (different ratematrices)

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 25 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 26 / 36

The global geometry of the Kimura variety

Recall that V := im(ϕ) where

ϕ :∏

e∈E(T )

C4 −→ C4n−1

(PeA,Pe

C ,PeG,Pe

T )e

7→ (qx1...xn){x1,...,xn|x1+···+xn=0}

But ∆4n−1−1 ⊂ {qA...A = 1}.

DefinitionThe Kimura variety of the phylogenetic tree T is W := V ∩ {qA...A = 1}.

W = ϕ(∏

e∈E(T )(C4 ∩ {PeA = 1})).

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 27 / 36

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Local complete intersection

V ⊂ AN algebraic variety, µ minimum number of generators ofI(V ), then

µ ≥ codim(V ) = N − dim(V )

.V is called a complete intersection if µ = codim(V ).

Example: the Kimura variety W ⊂ C4n−1−1 has codimension4n−1 − 6n + 8, n = number of leaves.For n = 4, the minimum number of generators of the Kimuravariety is 8002 whereas its codimension is 48!On a neighborhood of a smooth point, any variety is a localcomplete intersection (i.e. it can be defined by codim(V )equations).Goal:

1 Find the singular points.2 Provide a set of local generators at non-singular points.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 28 / 36

Case n = 3

The parameterization ϕ is:

ϕ : C9 −→ C15

((1,P1C ,P1

G,P1T ),(1,P2

C ,P2G,P2

T ),(1,P3C ,P3

G,P3T )) 7→ (qxyz=P1

x P2y P3

z ){x+y+z=0}

Let H = (Z2 × Z2, ∗) and let (ε, δ) ∈ H act on C9 sending(P1, P2, P3) to ((1,εP1

C ,δP1G,εδP1

T ),(1,εP2C ,δP2

G,εδP2T ),(1,εP3

C ,δP3G,εδP3

T )).

E.g. (ε, δ) = (1,−1), in probability parameters the group actionpermutes A ↔ C and G ↔ T .

Proposition

The Kimura variety W = im(ϕ) on T3 is the affine GIT quotient C9//H.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 29 / 36

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

Arbitrary n

T unrooted n-leaved tree(2n − 3 edges, n − 2 interior nodes)

Extending the action of H to the n − 2 interior nodes of T , we have

TheoremThe Kimura variety W is isomorphic to the affine GIT quotient

(C3)2n−3//Hn−2

CorollaryW = im(ϕ), no closure needed.

|ϕ−1(q)| ≤ 4n−2 and there is just one preimage with biologicalmeaning, ∀q ∈ W.

We are able to determine the singular points. In particular,biologically meaningful points q ∈ W+ := ϕ(

∏∆3

+) are notsingular.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 30 / 36

In Fourier parameters:

∆3+ =

In probability parameters this is transformed into:

A C G T

S =ACGT

(a b c db a d cc d a bd c b a

) a + b + c + d = 1

a + b − c − d > 0

a− b + c − d > 0

a− b − c + d > 0

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 31 / 36

In Fourier parameters:

∆3+ =

In probability parameters this is transformed into:

A C G T

S =ACGT

(a b c db a d cc d a bd c b a

) a + b + c + d = 1

a + b > 1/2

a + c > 1/2

a + d > 1/2

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 31 / 36

Local complete intersection

Case n = 3

(Sturmfels–Sullivant ’05): I(V ) is minimally generated by 16cubics and 18 quartics.

LemmaThe following six quartics

qAAAqATT qTCGqTGC−qACCqAGGqTAT qTTA, qCCAqCTGqTAT qTGC−qCACqCGT qTCGqTTA,

qAGGqATT qCACqCCA−qAAAqACCqCGT qCTG, qACCqATT qGAGqGGA−qAAAqAGGqGCT qGTC ,

qCACqCTGqGCT qGGA−qCCAqCGT qGAGqGTC , qGGAqGTCqTAT qTCG−qGAGqGCT qTGCqTTA

generate a local complete intersection that defines W at each pointq ∈ W+.

It does not depend on the point q. Skip proof

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 32 / 36

Proof.1 J = (f1 . . . , f6) defines a variety X containing W and both have the

same dimension.2 The points q ∈ W+ are smooth points of X and W .3 Both varieties coincide locally.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 33 / 36

Local complete intersection

Arbitrary n

quartics:

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics:

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics: J3

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics: J3

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Local complete intersection

Arbitrary n

quartics: J3 ∪Jn−1

quadrics: Q, non-redundant 2× 2 minors from flattening

TheoremJ3 ∪ Jn−1 ∪Q generate a local complete intersection that defines W atthe biologically meaningful points q ∈ W+.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 34 / 36

Outline

1 Algebraic evolutionary models

2 Phylogenetic inference using algebraic geometry

3 The geometry of the Kimura variety

4 New results on simulated data

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 35 / 36

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36

New results on simulated data

4 leaves

Instead of all generators, 48 polynomials that generate locally

Length 100 Length 500 Length 1000

(Eriksson–Yao’07) Machine learning approach, 52 invariants thatperform better on this parameter space.

M. Casanellas (UPC) Using algebraic geometry ... IMA, March 2007 36 / 36