recomb05 poster ent phasing

Poster Session A, Bay 30

Haplotype Inference by Entropy MinimizationIon Mandoiu and Bogdan Pasaniuc, CSE Department, University of Connecticut

• A Single Nucleotide Polymorphism (SNP) is a position in the genome at which exactly two of the possible four nucleotides occur in a large percentage of the population. SNPs account for most of the genetic variability between individuals, and mapping SNPs in human population has become the next high-priority in genomics after the completion of the Human Genome project.• In diploid organisms such as humans, there are two non-identical copies of each chromosome. A description of the SNPs in each chromosome is called a haplotype, which can be viewed as a 0/1 vector, e.g., by representing the most frequent (dominant) SNP allele as a 0 and the alternate (minor) allele as a 1.

Introduction

gcc{AT}ac{TG}

gccTacG

gccAacT

gccTacT

gccAacG

• At present, it is prohibitively expensive to directly determine the haplotypes of an individual, but it is possible to obtain rather easily the conflated SNP information in the so called genotype. A genotype can be conveniently represented as a 0/1/2 vector, where 0 (1) means that both chromosomes contain the dominant (respectively minor) allele, and 2 means that the two chromosomes contain different alleles.

M i n i m u m E n t r o p y P o p u l a t i o n P h a s i n g : G i v e n a s e t o f g e n o t y p e s , f i n d a p h a s i n g w i t h m i n i m u m e n t r o p y

P r o b l e m D e f i n i t i o n A p a i r o f h a p l o t y p e s ( h , h ’ ) e x p l a i n s g i f h ( i ) = h ’ ( i ) = g ( i ) w h e n e v e r g ( i ) i s 0 o r 1 ,

a n d h ( i ) ? h ’ ( i ) w h e n e v e r g ( i ) = 2

A p h a s i n g o f a s e t o f g e n o t y p e s G { 0 , 1 , 2 } k i s a f u n c t i o n f : G { 0 , 1 } k x { 0 , 1 } k

s u c h t h a t , f o r e v e r y g , f ( g ) i s a p a i r o f h a p l o t y p e s t h a t e x p l a i n g

E n t r o p y o f a p h a s i n g

w h e r e c o v ( h , f ) , i s t h e n u m b e r o f g e n o t y p e s g f r o m G s u c h t h a t f ( g ) = ( h , h ' ) o r f ( g ) = ( h ' , h ) p l u s t w i c e t h e n u m b e r o f o f g e n o t y p e s g s u c h t h a t f ( g ) = ( h , h )

),cov(log(||2

),cov()(0),co v (: G

fhfEntropyfhh

Approaches to Phasing

• Maximum Likelihood• PHASE [Stevens et al. 01] - repeatedly chooses a genotype at

random, and estimates that individual’s haplotypes under the assumption that all other haplotypes are correctly reconstructed

• GERBIL [Kimmel&Shamir 05] - expectation maximization for genotype resolution and block partitioning

• Perfect Phylogeny• Set of haplotypes used in the phasing must be consistent with a

perfect phylogeny [Gusfield 02]• Pure Parsimony

• Minimizing the number of distinct haplotypes• Integer Linear Program formulations: exponential size [Gusfield 04],

polynomial size [Brown&Harrower 05]• Entropy

• Minimize the entropy of the phasing• [Halperin&Karp 04] - simple greedy approximation algorithm

Previous Approaches

Local optimization algorithm for entropy minimization

1. Create a random phasing f 2. repeat forever

Find the pair (g ,(h ,h’)) that minimizes entropy(f’), where f’ is obtained from f by re-explaining g with (h,h’)If entropy(f’) < entropy(f) update f (change the current explanation for g to (h,h’))Else

exit loop3. Output cover

Phasing Short Genotypes

Switching Error (%)

--4.11.72.33.05.516.148.8800

04.122.33.25.615.949600

04.12.22.63.35.615.848.9400

0.24.53.13.146.115.848.2200

0.64.44.74.34.36.515.847.8100

1.74.36.565.96.816.347.750

5q31-euro(99 snp)

--11.33.04.07.510.824.848.5800

011.42.74.37.211.224.648.5600

011.53.65.28.111.724.748.5400

0.111.55.36.69.212.124.748.1200

0.69.99.1111212.924.948.2100

2.81217.215.616.115.725.847.550

5q31-wafr(89 snp)

9.811.118.417.617.917.325.135.429Gabriel2.62.74.64.23.75.115.943.5129Daly

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

D atasets• [D aly 2001] 129 fam ily trios over a reg ion of 103 SN Ps• [G abrie l 2002] 60 b locks w ith an average of 50 S N P s genotyped for 29 ind iv iduals• [Forton et a l 2004] S im ulated popula tions generated as fo llows

-32 E uropean and 32 W est A frican fam ily trios were genotyped a t t he IL8 and 5q31 reg ions [H u ll e t a l. 2000]-P opulation hap lo types and the ir frequencies were in ferred using P ham ily and P H A S E-B ased on these haplo types frequencies, 100,000 random genotypes are generated, from which we se lec ted populations of s ize between 50 and 800

Experim ental Setup

5q 31w afr - s w itch e rro r

50 100 200 400 600 800

win= 1

win= 3

win= 5

win= 7

win= 9

G E RB IL

P HA S E

Sw itch error rateG iven the true hap lo types(t,t’) and the in ferred ones(h ,h ’), sw itch error ra te is the num ber of tim es we have to sw itch from reading h to h ’ to obta in t, d iv ided by the num ber of am biguous positions.

IL8-datasets

--8.45.26.68.310.716.347.1800

0.18.55.66.98.810.916.247600

0.28.76.47.5910.916.546.9400

0.78.87.68.29.511.416.646.5200

2.59.710.19.71112.216.545.9100

4.99.610.911.611.612.517.245.150

IL8-wafr(52 snp)

--2.62.12.12.734.847.8800

0.42.622.12.73.14.747.6600

0.42.72.22.32.73.24.847.6400

0.52.62.32.22.83.14.747200

1.22.832.73.23.8546.2100

1.72.932.83.13.74.944.950

IL8-euro(55 snp)

k=9k=7k=5k=3k=1PHASEGERBIL

ENTROPY_PHASERAND#Gen

• Entropy minimization gives a unified framework for various phasing problem variants, including phasing genotypes with missing data and pedigree constrained phasing

• Preliminary results show that entropy minimization is competitive with existing methods in haplotype reconstruction accuracy, particularly for large populations

• Currently, we are implementing trio-based entropy phasing and are exploring other strategies for phasing long genotypes

References• V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: a direct approach

Technical Report UCDavis 2002• D Brown and I Harrower, A new integer programming formulation for the pure parsimony problem in

haplotype analysis, Proceedings of WABI 2004, 254-265• Daly et al., High resolution haplotype structure in the human genome, Nature Genetics, 29:229–232,

2001• Gabriel et al., The structure of haplotype blocks in the human genome, Science, 296:2225—2229, 2002.• E. Halperin and R. Karp. The Minimun-Entropy Set Cover Problem. International Colloquium on

Automata Languages and Programming 2004• J. Hull et al., Association of respiratory syncytial virus bronchiolitis with the interleukin 8 gene region in

UK families. Thorax 55:1023-1027, 2000• M. Stephens, N. Smith, and P. Donnelly. A new statistical method for haplotype reconstruction from

population data. American Journal of Human Genetics 68:978-989, 2001

Conclusions

Long Genotypes• Divide the genotypes into windows of size k• Run the previous algorithm for windows of size 2*k by fixing the first k snips.

Handling Missing Data• Any value is correct for a snip with missing data; the result is more pairs of haplotypes that can explain a genotype.• The local improvement algorithm remains the same

Trios• A family trio: two parents and a child• One haplotype from mother, one from father• At each step we re-explain a whole family

Extensions

recomb05 poster ent phasing

Documents

Transcript of recomb05 poster ent phasing

Phasing Out Lead From Gasoline: Worldwide Experience …siteresources.worldbank.org/INTURBANTRANSPORT/... · iv Phasing Out Lead from Gasoline: Worldwide Experience and ... phasing

Phasing RX Powerpoint

Smart Phasing – Needs based concepts for camshaft phasing ...

THE PHASING OF MAGNETRONS

Phasing Stick

L6: Haplotype phasing

Phasing - ESO

The Elevation Phasing Diagram

5.0 CONSTRUCTION METHODOLOGY & PHASING

Panel Phasing and eGauge installation · Panel Phasing and eGauge Installation 2 Phasing diagrams 2.1 Split-phase Split phase panels will generally have two “hot” lines and one

CLB031 The Time Phasing Methods lesson 1 - Th Time Ph ...s1093575.instanturl.net/anvari.net/DAU/Time Phasing Methods/CLB031... · CLB031 The Time Phasing Methods lesson 1 - Th Time

PHASING AND TRANSITIONS

Communicating Site Logistics & Phasing

Belcrest Road Phasing Program

Culvert and Pipe Phasing

Revit Phasing and Design Options

Revit Architecture, MEP + Structure Phasing Phasing...Revit Architecture, MEP + Structure Phasing Session Room 2 2152:15 pm – 3003:00 pm Steve FiorioSteve Fiorio

CLB031 The Time Phasing Methods Simulation: T ime Ph as ...cbafaculty.org/DAU/Time Phasing Methods/CLB031_L02_pf_508.pdf · CLB031 The Time Phasing Methods AlternatTime Ph asinSimulation

Phasing out of Suspended Sentences€¦ · Phasing out of Suspended Sentences: Background Report . to inform the phasing out of suspended sentences and the development of sentencing

Appendix III - Phasing 2013