A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir...

29
A Linear-Time A Linear-Time Algorithm for the Algorithm for the Perfect Phylogeny Perfect Phylogeny Haplotyping (PPH) Haplotyping (PPH) Problem Problem Zhihong Ding, Vladimir Filkov, Dan Zhihong Ding, Vladimir Filkov, Dan Gusfield Gusfield RECOMB 2005, pp. 585–600 RECOMB 2005, pp. 585–600 Date: Nov. 23, 2005 Date: Nov. 23, 2005 Introducer: Hsing-Yen Ann Introducer: Hsing-Yen Ann Modified from: Modified from: http://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppt http://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppt

Transcript of A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir...

Page 1: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

A Linear-Time A Linear-Time Algorithm for the Algorithm for the Perfect Phylogeny Perfect Phylogeny Haplotyping (PPH) Haplotyping (PPH)

ProblemProblemZhihong Ding, Vladimir Filkov, Dan GusfieldZhihong Ding, Vladimir Filkov, Dan Gusfield

RECOMB 2005, pp. 585–600RECOMB 2005, pp. 585–600

Date: Nov. 23, 2005Date: Nov. 23, 2005

Introducer: Hsing-Yen AnnIntroducer: Hsing-Yen Ann

Modified from: Modified from: http://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppthttp://wwwcsif.cs.ucdavis.edu/~gusfield/LPPH_RECOMB05.ppt

Page 2: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

2

AbstractAbstract

       Since the introduction of the Perfect Phylogeny Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002, the problem Haplotyping (PPH) Problem in RECOMB 2002, the problem of finding a linear-time (deterministic, worst-case) solution of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In problem and a series of papers on various aspects of it. In this paper we solve the open problem, giving a practical, this paper we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data-deterministic linear-time algorithm based on a simple data-structure and simple operations on it. The method is structure and simple operations on it. The method is straightforward to program and has been fully implemented. straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than Simulations show that it is much faster in practice than prior methods. The value of a linear-time solution to the prior methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the PPH problem is partly conceptual and partly for use in the inner-loop of algorithms for more complex problems, where inner-loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly. the PPH problem must be solved repeatedly.

Page 3: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

3

Haplotypes to GenotypesHaplotypes to Genotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

Merge the haplotypes (experiential results)

Sites: 1 2 3 4 5 6 7 8 9

two 0s 0two 1s 1one 0 + one 1 2

Page 4: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

4

Genotypes to HaplotypesGenotypes to Haplotypes

0 1 1 1 0 0 1 1 0

1 1 0 1 0 0 1 0 0

2 1 2 1 0 0 1 2 0

Two haplotypes per individual

Genotype for the individual

0 (0, 0)1 (1, 1) 2 (1, 0) or (0, 1)

2k possible solutions!!

Haplotype Inference Problem:Given a set of n genotypes (on the same sites), determine the original set of n haplotype pairs that generated the n genotypes

Page 5: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

5

The Perfect Phylogeny The Perfect Phylogeny Model of Haplotype Model of Haplotype

EvolutionEvolution

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral haplotype

Extant haplotypes at the leaves

Site mutations on edges

Perfect: Never mutate twice on the same site

Page 6: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

6

The Perfect Phylogeny The Perfect Phylogeny Haplotyping (PPH) ProblemHaplotyping (PPH) Problem

Given a set of genotypes, find an explaining set Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogenyof haplotypes that fits a perfect phylogeny

1

(a,b)

(b)

2

0011cc

2200bb

2222aa

2211

0011cc

0011cc

1100bb

0000bb

1100aa

0011aa

2211

10 01

00

Genotype matrix

Haplotype matrix

Perfect phylogeny

Site

(a,c,c)

Page 7: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

7

The PerfectionThe Perfection

A example A example that that does does notnot fit a perfect fit a perfect phylogenyphylogeny

1

(b)

(a,b)

2

0011cc

2200bb

2222aa

2211

0011cc

0011cc

1100bb

0000bb

0000aa

1111aa

2211

10 01

00

Genotype matrix

Haplotype matrix Not Perfect!!

Site

(c,c)

2

11(a)

1

11(a)

Page 8: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

8

Prior WorkPrior Work

Several existing algorithms:Several existing algorithms: A complex nearly-linear-time algorithm with A complex nearly-linear-time algorithm with

a little bug runs in O(a little bug runs in O(n m n m αα((n mn m))) time.) time. Two simpler but slower algorithms run in Two simpler but slower algorithms run in

O(O(n mn m2 2 ) time.) time.

Contribution of this paper:Contribution of this paper: A linear-time (O(A linear-time (O(n mn m)) algorithm.)) algorithm. Use a simple data-structure Shadow Tree Use a simple data-structure Shadow Tree

and some simple operations on it.and some simple operations on it.

Page 9: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

9

Shadow Tree (1/7)Shadow Tree (1/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 10: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

10

Shadow Tree (2/7)Shadow Tree (2/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 11: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

11

Shadow Tree (3/7)Shadow Tree (3/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 12: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

12

Shadow Tree (4/7)Shadow Tree (4/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 13: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

13

Shadow Tree (5/7)Shadow Tree (5/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 14: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

14

Shadow Tree (6/7)Shadow Tree (6/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 15: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

15

Shadow Tree (7/7)Shadow Tree (7/7)

rootroot

11 11

44

55

22

33

22

33

44

55

Tree edgeTree edgeShadow edgeShadow edgeClassClassFree linkFree linkFlippingFlippingFixed linkFixed linkClasses mergeClasses merge

Page 16: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

16

The AlgorithmThe Algorithm Process the genotype matrix one Process the genotype matrix one

row at a time, starting at the first row at a time, starting at the first row, and modify the shadow treerow, and modify the shadow tree

While processing an element in one While processing an element in one row, there are at most 4+3 cases, row, there are at most 4+3 cases, and all the cases can be done in and all the cases can be done in constant time.constant time.

Assumption: The genotype matrix Assumption: The genotype matrix only contains entries of value 0 and only contains entries of value 0 and 2.2.

Page 17: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

17

OldEntryListOldEntryList

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

OldEntryList for OldEntryList for row row 33: : 11, , 22, , 33, , 55

OldEntryList : column indices that OldEntryList : column indices that have entries of value 2 in this row have entries of value 2 in this row and also have entries of value 2 in and also have entries of value 2 in some previous rowssome previous rows

33

Page 18: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

18

Shadow Tree After Shadow Tree After Processing the First Two Processing the First Two

RowsRows rootroot

11 11

44

55

22

33

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 0 2 2 0 2 2 2 0 0 2 00 0 2 0

33

11

22

OldEntryList for OldEntryList for row 3 : row 3 : 11, , 22, , 33, , 55

22

33

44

55

Page 19: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

19

Algorithm – FirstPathAlgorithm – FirstPath

rootroot

11 11

44

55

22

33

22

33

44

55

OldEntryLOldEntryList:ist:CheckListCheckList: : 33

, , 22

22,, 33,, 5511,,

Edges Edges 44 and and 55 cannot be cannot be on the same on the same path to the path to the root in any root in any PPH solutionPPH solution

Page 20: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

20

Algorithm – SecondPathAlgorithm – SecondPath

rootroot

11 11

44

55

22

33

22

33

44

55

CheckLCheckList: ist:

33

OldEntryList: OldEntryList: 11, , 22, , 33, , 55 22

,,

Page 21: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

21

Shadow Tree to PPH Shadow Tree to PPH Solutions (1/2)Solutions (1/2)

rootroot

11 11

44

55

22

33

22

33

44

55

Genotype Genotype MatrixMatrix

2 2 2 0 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 2 2 2 2 2 2 02 0 22 2 2 0 0 2 00 0 2 0

One PPH One PPH SolutionSolution

Sites: 1 2 3 Sites: 1 2 3 4 54 5aa

bb

cc

dd

Final shadow treeFinal shadow tree

11

55

22

3344

Page 22: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

22

Shadow Tree to PPH Shadow Tree to PPH Solutions (2/2)Solutions (2/2)rootroot

1111

44

55

22

33

22

33

44

55Second PPH Second PPH

SolutionSolutionFinal shadow treeFinal shadow tree

55

33

11

2244a,da,d

b,cb,c

b,db,da,ca,c

Page 23: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

23

The EndThe End

Page 24: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

24

A P-Class of PPH A P-Class of PPH SolutionsSolutions

11 22

3355

44

Genotype Genotype MatrixMatrix

2 2 2 2 2 2 0 0 2 0 0 0 2 0 0 2 2 2 0 2 2 2 2 2 0 2 2 2 0 2 2 0 0 2 2 0 0 2

00

One PPH One PPH SolutionSolution

rooroott

P-Class: Maximum common P-Class: Maximum common subgraph in all PPH solutionssubgraph in all PPH solutions

Each P-Class consists of two Each P-Class consists of two subtreessubtrees

Sites: 1 2 3 Sites: 1 2 3 4 54 5

GenotypGenotypeses

aa

bb cc

dd

a,d

a,c

b,d

b,c

Page 25: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

25

P-Class Property of PPH P-Class Property of PPH SolutionsSolutions

Second PPH Second PPH SolutionsSolutions

All PPH solutions can be obtained by All PPH solutions can be obtained by choosing how to flip each P-Class.choosing how to flip each P-Class.

One PPH One PPH SolutionSolution

11 22

3355

44rooroo

tt

a,d

a,cb,c

b,d22

33

44

a,cb,d

rooroott11

a,d55

b,c

SwitchiSwitching ng pointpointss

SwitchiSwitching ng pointpointss

Page 26: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

26

The Key TheoremThe Key Theorem Every PPH solution can be obtained Every PPH solution can be obtained

by choosing a flip for each P-Class.by choosing a flip for each P-Class.

Conversely, after fixing one P-Conversely, after fixing one P-Class, every distinct choice of flips Class, every distinct choice of flips of P-Classes, leads to a distinct of P-Classes, leads to a distinct PPH solution.PPH solution.

If there are If there are kk P-Classes, there are P-Classes, there are 22k k –– 1 1 distinct PPH solutions. distinct PPH solutions.

Page 27: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

27

Shadow TreeShadow Tree Contains classesContains classes Each class in the shadow tree is a Each class in the shadow tree is a

subgraph of a P-Classsubgraph of a P-Class Merging classes results in larger Merging classes results in larger

classes, classes are never splitclasses, classes are never split Contains tree edges and shadow Contains tree edges and shadow

edgesedges

Page 28: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

28

Overview of the Algorithm Overview of the Algorithm for One Rowfor One Row

Procedure FirstPathProcedure FirstPath

Procedure SecondPathProcedure SecondPath

Procedure FixTreeProcedure FixTree

Procedure NewEntriesProcedure NewEntries

Page 29: A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:

29

Procedures FirstPath and Procedures FirstPath and SecondPathSecondPath

FirstPathFirstPath : Construct a first path : Construct a first path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of as which passes through tree edges of as many columns in OldEntryList as many columns in OldEntryList as possiblepossible

SecondPathSecondPath : Construct a second path : Construct a second path towards the root of the shadow tree towards the root of the shadow tree which passes through tree edges of which passes through tree edges of columns in OldEntryList and not on the columns in OldEntryList and not on the first pathfirst path