recomb05 poster ent phasing

Post on 22-Feb-2016

44 views 0 download

description

Haplotype Inference by Entropy Minimization Ion Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut. - PowerPoint PPT Presentation

Transcript of recomb05 poster ent phasing

Poster Session A, Bay 30

Haplotype Inference by Entropy MinimizationIon Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut

• A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals, and mapping SNPs in human population has become the next high-priority in genomics after the completion of the Human Genome project.• In diploid organisms such as humans, there are two non-identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype, which can be viewed as a 0/1 vector, e.g., by representing the most frequent (dominant) SNP allele as a 0 and the alternate (minor) allele as a 1.

Introduction

gcc{AT}ac{TG}

gccTacG

gccAacT

gccTacT

gccAacG

• At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. A genotype can be conveniently represented as a 0/1/2 vector, where 0 (1) means that both chromosomes contain the dominant (respectively minor) allele, and 2 means that the two chromosomes contain different alleles.

?

M i n i m u m E n t r o p y P o p u l a t i o n P h a s i n g : G i v e n a s e t o f g e n o t y p e s , f i n d a p h a s i n g w i t h m i n i m u m e n t r o p y

P r o b l e m D e f i n i t i o n A p a i r o f h a p l o t y p e s ( h , h ’ ) e x p l a i n s g i f h ( i ) = h ’ ( i ) = g ( i ) w h e n e v e r g ( i ) i s 0 o r 1 ,

a n d h ( i ) ? h ’ ( i ) w h e n e v e r g ( i ) = 2

A p h a s i n g o f a s e t o f g e n o t y p e s G { 0 , 1 , 2 } k i s a f u n c t i o n f : G { 0 , 1 } k x { 0 , 1 } k

s u c h t h a t , f o r e v e r y g , f ( g ) i s a p a i r o f h a p l o t y p e s t h a t e x p l a i n g

E n t r o p y o f a p h a s i n g

w h e r e c o v ( h , f ) , i s t h e n u m b e r o f g e n o t y p e s g f r o m G s u c h t h a t f ( g ) = ( h , h ' ) o r f ( g ) = ( h ' , h ) p l u s t w i c e t h e n u m b e r o f o f g e n o t y p e s g s u c h t h a t f ( g ) = ( h , h )

)||2

),cov(log(||2

),cov()(0),co v (: G

fhG

fhfEntropyfhh

Approaches to Phasing

• Maximum Likelihood• PHASE [Stevens et al. 01] - repeatedly chooses a genotype at

random, and estimates that individual’s haplotypes under the assumption that all other haplotypes are correctly reconstructed

• GERBIL [Kimmel&Shamir 05] - expectation maximization for genotype resolution and block partitioning

• Perfect Phylogeny• Set of haplotypes used in the phasing must be consistent with a

perfect phylogeny [Gusfield 02]• Pure Parsimony

• Minimizing the number of distinct haplotypes• Integer Linear Program formulations: exponential size [Gusfield 04],

polynomial size [Brown&Harrower 05]• Entropy

• Minimize the entropy of the phasing• [Halperin&Karp 04] - simple greedy approximation algorithm

Previous Approaches

Local optimization algorithm for entropy minimization

1. Create a random phasing f 2. repeat forever

Find the pair (g ,(h ,h’)) that minimizes entropy(f’), where f’ is obtained from f by re-explaining g with (h,h’)If entropy(f’) < entropy(f) update f (change the current explanation for g to (h,h’))Else

exit loop3. Output cover

Phasing Short Genotypes

Switching Error (%)

--4.11.72.33.05.516.148.8800

04.122.33.25.615.949600

04.12.22.63.35.615.848.9400

0.24.53.13.146.115.848.2200

0.64.44.74.34.36.515.847.8100

1.74.36.565.96.816.347.750

5q31-euro(99 snp)

--11.33.04.07.510.824.848.5800

011.42.74.37.211.224.648.5600

011.53.65.28.111.724.748.5400

0.111.55.36.69.212.124.748.1200

0.69.99.1111212.924.948.2100

2.81217.215.616.115.725.847.550

5q31-wafr(89 snp)

9.811.118.417.617.917.325.135.429Gabriel2.62.74.64.23.75.115.943.5129Daly

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

D atasets• [D aly 2001] 129 fam ily trios over a reg ion of 103 SN Ps• [G abrie l 2002] 60 b locks w ith an average of 50 S N P s genotyped for 29 ind iv iduals• [Forton et a l 2004] S im ulated popula tions generated as fo llows

-32 E uropean and 32 W est A frican fam ily trios were genotyped a t t he IL8 and 5q31 reg ions [H u ll e t a l. 2000]-P opulation hap lo types and the ir frequencies were in ferred using P ham ily and P H A S E-B ased on these haplo types frequencies, 100,000 random genotypes are generated, from which we se lec ted populations of s ize between 50 and 800

Experim ental Setup

5q 31w afr - s w itch e rro r

0

5

10

15

20

25

30

50 100 200 400 600 800

#ge n

erro

r ra

te

win= 1

win= 3

win= 5

win= 7

win= 9

G E RB IL

P HA S E

Sw itch error rateG iven the true hap lo types(t,t’) and the in ferred ones(h ,h ’), sw itch error ra te is the num ber of tim es we have to sw itch from reading h to h ’ to obta in t, d iv ided by the num ber of am biguous positions.

IL8-datasets

--8.45.26.68.310.716.347.1800

0.18.55.66.98.810.916.247600

0.28.76.47.5910.916.546.9400

0.78.87.68.29.511.416.646.5200

2.59.710.19.71112.216.545.9100

4.99.610.911.611.612.517.245.150

IL8-wafr(52 snp)

--2.62.12.12.734.847.8800

0.42.622.12.73.14.747.6600

0.42.72.22.32.73.24.847.6400

0.52.62.32.22.83.14.747200

1.22.832.73.23.8546.2100

1.72.932.83.13.74.944.950

IL8-euro(55 snp)

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

• Entropy minimization gives a unified framework for various phasing problem variants, including phasing genotypes with missing data and pedigree constrained phasing

• Preliminary results show that entropy minimization is competitive with existing methods in haplotype reconstruction accuracy, particularly for large populations

• Currently, we are implementing trio-based entropy phasing and are exploring other strategies for phasing long genotypes

References• V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: a direct approach

Technical Report UCDavis 2002• D Brown and I Harrower, A new integer programming formulation for the pure parsimony problem in

haplotype analysis, Proceedings of WABI 2004, 254-265• Daly et al., High resolution haplotype structure in the human genome, Nature Genetics, 29:229–232,

2001• Gabriel et al., The structure of haplotype blocks in the human genome, Science, 296:2225—2229, 2002.• E. Halperin and R. Karp. The Minimun-Entropy Set Cover Problem. International Colloquium on

Automata Languages and Programming 2004• J. Hull et al., Association of respiratory syncytial virus bronchiolitis with the interleukin 8 gene region in

UK families. Thorax 55:1023-1027, 2000• M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from

population data. American Journal of Human Genetics 68:978-989, 2001

Conclusions

Long Genotypes• Divide the genotypes into windows of size k• Run the previous algorithm for windows of size 2*k by fixing the first k snips.

k

Handling Missing Data• Any value is correct for a snip with missing data; the result is more pairs of haplotypes that can explain a genotype.• The local improvement algorithm remains the same

Trios• A family trio: two parents and a child• One haplotype from mother, one from father• At each step we re-explain a whole family

Extensions