June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo...

34
March 27, 2022 1 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo...

April 18, 2023 1

Combinatorial methods in Bioinformatics: the haplotyping

problem

Paola BonizzoniDISCo

Università di Milano-Bicocca

April 18, 2023 2

Content

Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH

problem Inference of incomplete perfect phylogeny:

algorithms Incomplete pph and missing data Other models: open problems

April 18, 2023 3

Biological termsDiploid organism

haplotype A

A

A

maternal

G

C

A

paternal

genotype

homozygous

heterozygous i

i+1

i+2

Biallelic site i

|Value(i) { A,C,G,T}| 2

April 18, 2023 4

Motivations Human genetic variations are related to diseases (cancers, diabetes,

osteoporoses) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes

The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are

demanded Ongoing international HapMap project: find haplotype differences on large

scalepopulation data

Combinatorial methods:

graphs

Set-cover problems

Optimization problems

April 18, 2023 5

Haplotyping: the formal model Haplotype: m-vector h=<0, 1,…, 0> over {0,1}m

Genotype: m-sequence g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*}

Def. Haplotypes <h, k> solve genotype g iff : g(i)=* implies h(i) k(i)

h(i)= k(i)= g(i) otherwise

* 0 1 g = <*, …, 0,…, 1 >

April 18, 2023 6

Examplesg =<0,*,1,*,0,1,1>

h=<0,1,1,0,0,1,1>

k=<0,0,1,1,0,1,1> g solved by <k,h> g k

Clark inference rule

g1 =<0,*,1,*,0,1,1> g2 =<0,1,*,0,0,1,1>

h1=<0,0,1,1,0,1,1>

g3 =<0,0,*,*,1,1,1>

g3 =<0,1,0,*,0,1,1>

h2=<0,1,1,0,0,1,1>

h1=<0,0,1,1,0,1,1>

g2 =<0,1,*,0,0,1,1>

h2=<0,1,1,0,0,1,1>

h1=<0,0,1,1,0,1,1>

h3=<0,1,0,0,0,1,1>

g3 =<0,1,0,*,0,1,1>

h

g1 h2

h1

April 18, 2023 7

Haplotype inference: the general problem

Problem HI: Instance: a set G={g1, …,g m} of genotypes and

a set H={h1, …,h n } of haplotypes,

Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H H’.

H’ derives from an inference RULE

April 18, 2023 8

Type of inference rules

Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related

to genotypes by a tree model Pedigree data: haplotypes are related to genotypes

by a directed graph

April 18, 2023 9

Mendelian law and Recombination

BA

Father

C D

Mother

A C A D B C DB

C1 C2 C3 C4

BD

AC

Parent

AC

BD

AD

BC

Child:

April 18, 2023 10

PedigreePedigree

Pedigree, nuclear family, founder

April 18, 2023 11

PedigreePedigree

Pedigree, nuclear family, founder

Father Mother

Children

ID Num

Genotypes

Founders

Nuclear family

Familytrioloop

Mating node

April 18, 2023 12

Haplotyping from genotypes: Haplotyping from genotypes: The problem & methodsThe problem & methods

Problem: Input: genotype data (missing). Output: haplotypes.

Input data: Data with pedigree (dependent). Data without pedigree info (independent).

Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive

Rule-based methods Define rules based on some plausible assumptions and find those

haplotypes consistent with these rules. Adv: usually simple thus very fast

Disadv: no numerical assessment of the reliability of the results

April 18, 2023 13

HI by the perfect phylogeny model

IDEA:

0, 1,1,0,1

0, 1,0,1,1

g1= 0, 1,*,*,1

g2= *, 0,0,0,1

1, 0,0,0,10, 0,0,0,1

G H

Genotypes are the mating of haplotypes in a tree

Given G find H and T that explain G!

00000

April 18, 2023 14

Perfect Phylogeny models

Input data: 0-1 matrix A characters, species Output data: phylogeny for A

s1

s2

s3

s4

c1 c3c2 c5c4

1 1 0 0 0

0 0 1 0 0

1 1 0 0 1

0 0 1 1 0

Path c3c4

s4

s2 s1s3

c3

c4

c2C1 ,

c5

R

April 18, 2023 15

Perfect phylogeny

each row si labels exactly one leaf of T each column cj labels exactly one

edge of T each internal edge labelled by at

least one column cj

row si gives the 0,1 path from the root to si

Def. A pp T for a 0-1 matrix A:

s4

s2 s1s3

c3

c4

c2C1 ,

c5

Path c3c4

0 0 1 1

April 18, 2023 16

pp model: another view

L(x) cluster of x:

set of leaves of T x

s4

s2

s1s3

x

A pp is associated to a tree-family (S,C) with S={s1 ,…, sn} C={S’ S: S’ is a cluster} s.t. X, Y in C , if XY then XY or Y X.

April 18, 2023 17

pp : another view

A tree-family (S,C) is represented by a 0-1 matrix:

0 1 0 0 0

0 0 1 0 0

1 1 0 0 1

0 0 1 1 0

c i • c i S’ : s j S’ iff b

ji=1

s j

Lemma

A 0-1 matrix is a pp iff it represents a tree-family

• for each set in C at least a column

April 18, 2023 18

Haplotyping by the pp

A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes:

si haplotype ci SNPs

1100001001

01000

000000000001000

1100001001si

ci

0-1 switch in position ionly once in the tree !!

SNP site

01000

00000

April 18, 2023 19

Haplotyping and the pp: observations

The root of T may not be the haplotype 000000 0-1 switch or 1-0 switch (directed case)

0-1 switch

01100

11000

01000

00011

1-0 switch

00011

01000

00011

01000

0101011010

01010

00011

0100111001

01001

00000

April 18, 2023 20

HI problem in the pp model Input data: a 0-1-*matrix B n m of genotypes G Output data: a 0-1 matrix B’ 2n m of haplotypes s.t. (1) each g G is solved by a pair of rows <h,k> in B’ (2) B’ has a pp (tree family)

DECISION Problem

0, 1,0,1,1

01*1*001*001*11*110000*1*1*

???

April 18, 2023 21

An example

a * *

b 0 *

c 1 0

a 1 0

a’ 0 1

b 0 1

b’ 0 0

c 1 0

c’ 1 0

a

c c’

b’

a’ b

April 18, 2023 22

The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm2)- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm

A related problem: the incomplete directed pp (IDP)

Inferring a pp from a 0-1-* matrixO(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R.

Sharan SIAM 2004

April 18, 2023 23

IDP problem

OPEN PROBLEM: find an optimal algorithm ??

C1

C2 C4

C5

C3

S2S1 S3

1 ? 0 0 1? ? 0 1 0? 0 1 ? ?

1 2 3 4 5

1 0 0 0 1? ? 0 1 0? 0 1 ? ?

1 0 0 0 11 1 0 1 0? 0 1 ? ?

1 0 0 0 11 ? 0 1 0? 0 1 ? ?

1 0 0 0 11 1 0 1 01 0 1 0 1

Instance: A 0-1-? Matrix ASolution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

April 18, 2023 24

Decision algorithms for incomplete pp

Based on: Characterization of 0-1 matrix A that has a pp

-Tree family - - forbidden submatrix – give a no certificate

1 01 1

0 1

00

01 10

11

XY

Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1}

Forbidden subgraph c C’

s1 s3s210 11

01

April 18, 2023 25

Test: a 0-1 matrix A has a pp?

O(nm) algorithm (Gusfield 1991)Steps: 1. Given A order {c1, …,cm} as (decreasing)

binary numbers A’2. Let L(i,j)=k , k = max{l <j: A’[i,l]=1}3. Let index(j) = max{L(i,j): i}4. Then apply th.

TH. A’ has a pp iff L(i,j) = index(j) for each (i,j)

s.t. A’[i,j]=1

April 18, 2023 26

Idea:

April 18, 2023 27

The IDP algorithm

c C’

s1 s3s2

April 18, 2023 28

Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rowsgenotype pp (Igpp) genotype rows

Algorithms:

Ihpp = IDP given a row as a root (polynomial time)

NP-complete otherwise

Igpp has polynomial solution under rich data hypothesis (Karp et al. Recomb 2004 – Icalp 2004 )

NP-complete otherwise

April 18, 2023 29

HI problem and other models Haplotype inference in pedigree data

under the recombination model

0

0

0

1

1

1

maternal

0

0

1

1

0

0

0

0

0

0

paternal

0

0

0

0

0

0

0

0

0

0

1

1

recombination

child

April 18, 2023 30

Pedigree graphSingle Mating Pedigree Tree

Mating loop

Nuclear family

Pedigree Graph

father mather

child

April 18, 2023 31

Haplotype inference in pedigree

00

01

10

10

11

00

01

11

01

0|0

0|1

1|0

1|0

1|1

0|0

01

11

10

0|0

1|0

1|0

0|0

0|1

0|1

0|0

1|0

0|1

0|1

1|1

0|0

Paternal maternal

0

1

1

1

1

0

0|1

1|1

1|0

April 18, 2023 32

Problems:

MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI)

OPEN

Np-complete even if the graph is acyclic, but unbounded number of children…

April 18, 2023 33

Conclusions

April 18, 2023 34

References