June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
0
Transcript of June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo...
April 18, 2023 1
Combinatorial methods in Bioinformatics: the haplotyping
problem
Paola BonizzoniDISCo
Università di Milano-Bicocca
April 18, 2023 2
Content
Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH
problem Inference of incomplete perfect phylogeny:
algorithms Incomplete pph and missing data Other models: open problems
April 18, 2023 3
Biological termsDiploid organism
haplotype A
A
A
maternal
G
C
A
paternal
genotype
homozygous
heterozygous i
i+1
i+2
Biallelic site i
|Value(i) { A,C,G,T}| 2
April 18, 2023 4
Motivations Human genetic variations are related to diseases (cancers, diabetes,
osteoporoses) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes
The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are
demanded Ongoing international HapMap project: find haplotype differences on large
scalepopulation data
Combinatorial methods:
graphs
Set-cover problems
Optimization problems
April 18, 2023 5
Haplotyping: the formal model Haplotype: m-vector h=<0, 1,…, 0> over {0,1}m
Genotype: m-sequence g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*}
Def. Haplotypes <h, k> solve genotype g iff : g(i)=* implies h(i) k(i)
h(i)= k(i)= g(i) otherwise
* 0 1 g = <*, …, 0,…, 1 >
April 18, 2023 6
Examplesg =<0,*,1,*,0,1,1>
h=<0,1,1,0,0,1,1>
k=<0,0,1,1,0,1,1> g solved by <k,h> g k
Clark inference rule
g1 =<0,*,1,*,0,1,1> g2 =<0,1,*,0,0,1,1>
h1=<0,0,1,1,0,1,1>
g3 =<0,0,*,*,1,1,1>
g3 =<0,1,0,*,0,1,1>
h2=<0,1,1,0,0,1,1>
h1=<0,0,1,1,0,1,1>
g2 =<0,1,*,0,0,1,1>
h2=<0,1,1,0,0,1,1>
h1=<0,0,1,1,0,1,1>
h3=<0,1,0,0,0,1,1>
g3 =<0,1,0,*,0,1,1>
h
g1 h2
h1
April 18, 2023 7
Haplotype inference: the general problem
Problem HI: Instance: a set G={g1, …,g m} of genotypes and
a set H={h1, …,h n } of haplotypes,
Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H H’.
H’ derives from an inference RULE
April 18, 2023 8
Type of inference rules
Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related
to genotypes by a tree model Pedigree data: haplotypes are related to genotypes
by a directed graph
April 18, 2023 9
Mendelian law and Recombination
BA
Father
C D
Mother
A C A D B C DB
C1 C2 C3 C4
BD
AC
Parent
AC
BD
AD
BC
Child:
April 18, 2023 11
PedigreePedigree
Pedigree, nuclear family, founder
Father Mother
Children
ID Num
Genotypes
Founders
Nuclear family
Familytrioloop
Mating node
April 18, 2023 12
Haplotyping from genotypes: Haplotyping from genotypes: The problem & methodsThe problem & methods
Problem: Input: genotype data (missing). Output: haplotypes.
Input data: Data with pedigree (dependent). Data without pedigree info (independent).
Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive
Rule-based methods Define rules based on some plausible assumptions and find those
haplotypes consistent with these rules. Adv: usually simple thus very fast
Disadv: no numerical assessment of the reliability of the results
April 18, 2023 13
HI by the perfect phylogeny model
IDEA:
0, 1,1,0,1
0, 1,0,1,1
g1= 0, 1,*,*,1
g2= *, 0,0,0,1
1, 0,0,0,10, 0,0,0,1
G H
Genotypes are the mating of haplotypes in a tree
Given G find H and T that explain G!
00000
April 18, 2023 14
Perfect Phylogeny models
Input data: 0-1 matrix A characters, species Output data: phylogeny for A
s1
s2
s3
s4
c1 c3c2 c5c4
1 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
Path c3c4
s4
s2 s1s3
c3
c4
c2C1 ,
c5
R
April 18, 2023 15
Perfect phylogeny
each row si labels exactly one leaf of T each column cj labels exactly one
edge of T each internal edge labelled by at
least one column cj
row si gives the 0,1 path from the root to si
Def. A pp T for a 0-1 matrix A:
s4
s2 s1s3
c3
c4
c2C1 ,
c5
Path c3c4
0 0 1 1
April 18, 2023 16
pp model: another view
L(x) cluster of x:
set of leaves of T x
s4
s2
s1s3
x
A pp is associated to a tree-family (S,C) with S={s1 ,…, sn} C={S’ S: S’ is a cluster} s.t. X, Y in C , if XY then XY or Y X.
April 18, 2023 17
pp : another view
A tree-family (S,C) is represented by a 0-1 matrix:
0 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
c i • c i S’ : s j S’ iff b
ji=1
s j
Lemma
A 0-1 matrix is a pp iff it represents a tree-family
• for each set in C at least a column
April 18, 2023 18
Haplotyping by the pp
A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes:
si haplotype ci SNPs
1100001001
01000
000000000001000
1100001001si
ci
0-1 switch in position ionly once in the tree !!
SNP site
01000
00000
April 18, 2023 19
Haplotyping and the pp: observations
The root of T may not be the haplotype 000000 0-1 switch or 1-0 switch (directed case)
0-1 switch
01100
11000
01000
00011
1-0 switch
00011
01000
00011
01000
0101011010
01010
00011
0100111001
01001
00000
April 18, 2023 20
HI problem in the pp model Input data: a 0-1-*matrix B n m of genotypes G Output data: a 0-1 matrix B’ 2n m of haplotypes s.t. (1) each g G is solved by a pair of rows <h,k> in B’ (2) B’ has a pp (tree family)
DECISION Problem
0, 1,0,1,1
01*1*001*001*11*110000*1*1*
???
April 18, 2023 21
An example
a * *
b 0 *
c 1 0
a 1 0
a’ 0 1
b 0 1
b’ 0 0
c 1 0
c’ 1 0
a
c c’
b’
a’ b
April 18, 2023 22
The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm2)- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm
A related problem: the incomplete directed pp (IDP)
Inferring a pp from a 0-1-* matrixO(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R.
Sharan SIAM 2004
April 18, 2023 23
IDP problem
OPEN PROBLEM: find an optimal algorithm ??
C1
C2 C4
C5
C3
S2S1 S3
1 ? 0 0 1? ? 0 1 0? 0 1 ? ?
1 2 3 4 5
1 0 0 0 1? ? 0 1 0? 0 1 ? ?
1 0 0 0 11 1 0 1 0? 0 1 ? ?
1 0 0 0 11 ? 0 1 0? 0 1 ? ?
1 0 0 0 11 1 0 1 01 0 1 0 1
Instance: A 0-1-? Matrix ASolution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”
April 18, 2023 24
Decision algorithms for incomplete pp
Based on: Characterization of 0-1 matrix A that has a pp
-Tree family - - forbidden submatrix – give a no certificate
1 01 1
0 1
00
01 10
11
XY
Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1}
Forbidden subgraph c C’
s1 s3s210 11
01
April 18, 2023 25
Test: a 0-1 matrix A has a pp?
O(nm) algorithm (Gusfield 1991)Steps: 1. Given A order {c1, …,cm} as (decreasing)
binary numbers A’2. Let L(i,j)=k , k = max{l <j: A’[i,l]=1}3. Let index(j) = max{L(i,j): i}4. Then apply th.
TH. A’ has a pp iff L(i,j) = index(j) for each (i,j)
s.t. A’[i,j]=1
April 18, 2023 28
Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rowsgenotype pp (Igpp) genotype rows
Algorithms:
Ihpp = IDP given a row as a root (polynomial time)
NP-complete otherwise
Igpp has polynomial solution under rich data hypothesis (Karp et al. Recomb 2004 – Icalp 2004 )
NP-complete otherwise
April 18, 2023 29
HI problem and other models Haplotype inference in pedigree data
under the recombination model
0
0
0
1
1
1
maternal
0
0
1
1
0
0
0
0
0
0
paternal
0
0
0
0
0
0
0
0
0
0
1
1
recombination
child
April 18, 2023 30
Pedigree graphSingle Mating Pedigree Tree
Mating loop
Nuclear family
Pedigree Graph
father mather
child
April 18, 2023 31
Haplotype inference in pedigree
00
01
10
10
11
00
01
11
01
0|0
0|1
1|0
1|0
1|1
0|0
01
11
10
0|0
1|0
1|0
0|0
0|1
0|1
0|0
1|0
0|1
0|1
1|1
0|0
Paternal maternal
0
1
1
1
1
0
0|1
1|1
1|0
April 18, 2023 32
Problems:
MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI)
OPEN
Np-complete even if the graph is acyclic, but unbounded number of children…