SNPs, Haplotypes, Disease Associations

Post on 24-Jan-2016

75 views 0 download

Tags:

description

SNPs, Haplotypes, Disease Associations. Algorithmic Foundations of Computational Biology II Course 1. Prof. Sorin Istrail. SNPs and the Human Genome: The Minimal Informative Subset. Overview. Introduction: SNPs, Haplotypes A Data Compression Problem: - PowerPoint PPT Presentation

Transcript of SNPs, Haplotypes, Disease Associations

SNPs, Haplotypes,DiseaseAssociations

Algorithmic Foundations of Computational Biology II

Course 1

Prof. Sorin Istrail

SNPs and the Human Genome:The Minimal Informative Subset

Overview

Introduction:

SNPs, Haplotypes A Data Compression Problem:

The Minimum Informative Subset A New Measure:

Informativeness

A Most Challenging Problem

“None of the [advances of the 20th century medicine] depend on a deep knowledge of cellular processes or on any discoveries of molecular biology.

Cancer is still treated by gross physical and chemical assaults on the offending tissue.

Cardiovascular Disease is treated by surgery whose anatomical bases go back to the 19th century …Of course, intimate knowledge of the living cell and of basic molecular processes may be usefuleventually.”

Lewontin (1991)

Now

“A decade later, molecular biology can claim very few successes for drugs in clinical use that were designed ab initioto control a specific component of a pathwaylinked to disease: these include themonoclonal antibody Herceptin, and the kinase inhibitor Gleevec.”

Reik, Gregory and Urnov (2002)

Introduction

SNPs, HAPLOTYPES

A SNP is a position in a genome at which two or more different bases occur in the population, each with a frequency >1%.

GATTTAGATCGCGATAGAGGATTTAGATCTCGATAGAG

The most abundant type of polymorphism

The two alleles at the site are G and T

Single Nucleotide Polymorphism (SNP)

tttctccatttgtcgtgacacctttgttgacaccttcatttctgcattctcaattctatttcactggtctatggcagagaacacaaaatatggccagtggcctaaatccagcctactaccttttttttttttttgtaacattttactaacatagccattcccatgtgtttccatgtgtctgggctgcttttgcactctaatggcagagttaagaaattgtagcagagaccacaatgcctcaaatatttactctacagccctttataaaaacagtgtgccaactcctgatttatgaacttatcattatgtcaataccatactgtctttattactgtagttttataagtcatgacatcagataatgtaaatcctccaactttgtttttaatcaaaagtgttttggccatcctagatatactttgtattgccacataaatttgaagatcagcctgtcagtgtctacaaaatagcatgctaggattttgatagggattgtgtagaatctatagattaattagaggagaatgactatcttgacaatactgctgcccctctgtattcgtgggggattggttccacaacaacacccaccccccactcggcaacccctgaaacccccacatcccccagcttttttcccctgctaccaaaatccatggatgctcaagtccatataaaatgccatactatttgcatataacctctgcaatcctcccctatagtttagatcatctctagattacttataatactaataaaatctaaatgctatgtaaatagttgctatactgtgttgagggttttttgttttgttttgttttatttgtttgtttgtttgtattttaagagatggtgtcttgctttgttgcccaggctggagtgcagtggtgagatcatagcttactgcagcctcaaactcctggactcaaacagtcctcccacctcagcctcccaaagtgctgggatacaggtgtgacccactgtgcccagttattattttttatttgtattattttactgttgtattatttttaattattttttctgaatattttccatctatagttggttgaatcatggatgtggaacaggcaaatatggagggctaactgtattgcatcttccagttcatgagtatgcagtctctctgtttatttaaagttttagtttttctcaaccatgtttacttttcagtatacaagactttgacgttttttgttaaatgtatttgtaagtattttattatttgtgatgttatttaaaaagaaattgttgactgggcacagtggctcacgcctgtaatcccagcactttgggaggctgaggcgggcagatcacgaggtcaggagatcaagaccatcctggctaacatggtaaaaccccgtctctactaaaaatagaaaaaaattagccaggcgtggtggcgagtgcctgtagtcccagctactcgggaggctgaggcaggagaatggtgtgaacctgggaggcggagcttgcagtgagctgagatcgtgccactgcattccagcctgcgtgacagagcgagactctgtcaaaaaaataaataaaatttaaaaaaagaagaagaaattattttcttaatttcattttcaggttttttatttatttctactatatggatacatgattgatttttgtatattgatcatgtatcctgcaaactagctaacatagtttattatttctctttttttgtggattttaaaggattttctacatagataaataaacacacataaacagttttacttctttcttttcaacctagactggatgcattttttgtttttgtttgtttgtttgctttttaacttgctgcagtgactagagaatgtattgaagaatatattgttgaacaaaagcagtgagagtggacatccctgctttccccctgattttagggggaatgttttcagtctttcactatttaatatgattttagctataggtttatcctagatccctgttatcatgttgaggaaattcccttctatttctagtttgttgagattttttaattcatgtgattgcgctatctggctttgctctca

tc

ga

ga

ga

ga

ga

gc

gc

gc

tc

ga

ga

ga

ga

ga

tc

tc

tc

tc

ga

ga

ga

tc

gc

tc

tc

tc

Human Genome contains ~ 3 G basepairs arranged in 46 chromosomes.

Two individuals are 99.9% the same. I.e. differ in ~ 3 M basepairs.

SNPs occur once every ~600 bp

Average gene in the human

genome spans ~27Kb

~50 SNPs per gene

G C T C G A C A A C A GG T T C G T C A A C A G

Two individuals

C A G HaplotypesT T G

SNP SNP SNP

Haplotype

Mutations

Infinite Sites Assumption:

Each site mutates at most once

Haplotype Pattern

0 0 0 01 1 0 10 0 1 00 1 0 1

C A G TT T G AC A T GC T G T

At each SNP site label the two alleles as 0 and 1.

The choice which allele is 0 and which one is 1

is arbitrary.

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

Recombination

G T T C G A C T A T T A

G T T C G A C A A C A TA C G T A T C T A T T A

The two alleles are linked, I.e., they are “traveling together”

?

Recombinationdisrupts the linkage

Recombination

Variations in Chromosomes Within a Population

Common Ancestor

Emergence of Variations Over Time

time present

Disease Mutation

Linkage Disequilibrium (LD)

Time = present

2,000 gens. ago

Disease-Causing Mutation

1,000 gens. ago

Extent of Linkage Disequilibrium

A Data Compression Problem

The Minimum Informative Subset

A Data Compression Problem Select SNPs to use in an association study

Would like to associate single nucleotide polymorphisms (SNPs) with disease.

Very large number of candidate SNPs Chromosome wide studies, whole genome-scans For cost effectiveness, select only a subset.

Closely spaced SNPs are highly correlated It is less likely that there has been a recombination between two

SNPs if they are close to each other.

Disease Associations

Association studies

DiseaseResponder

ControlNon-responder

Allele 0 Allele 1

Marker A is associated with

Phenotype

Marker A:

Allele 0 =

Allele 1 =

Evaluate whether nucleotide polymorphisms associate with phenotype

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

T A GA A

C G GA A

C G TA A

T A TC G

T G TA G

T G GA G

Association studies

1 1 00 0

0 0 00 0

0 0 10 0

1 1 11 1

1 0 10 1

1 0 00 1

Association studies

Compression based on Haplotype Resolution

0 1 01 1

1 0 00

0 0 10 1

1

For a SNP s we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different alleles at s.

s1

s2

D-graph of a SNP

0 1 01 1

1 0 00

0 0 10 1

1

For a set of SNPs S we associate a bipaprtite graph.

Nodes: the set of haplotypes.

Edges: the set of pairs of haplotypes with different

alleles at some SNP s in S.

s1

s2

D-graph of a set of SNPs

0 1 01 1

1 0 00

0 0 10 1

1

Red SNP is equivalent to Blue SNP

SNP Selection

Red SNPs predict Green SNP

0 1 01 1

1 0 00

0 0 10 1

1

SNP Selection

Minimal Informative Subset

0 1 01 1

1 0 00

0 0 10 1

1

Data Compression

Compresssion based on Haplotype Blocks

Hypothesis – Haplotype Blocks?

The genome consists largely of blocks of

common SNPs with relatively little recombination

within the blocks Patil et al., Science, 2001; Jeffreys et al., Nature Genetics, 2001; Daly et al., Nature Genetics, 2001

Sense genes

Antisense genes

200 kb

1 2 3 4

DNA

SNPs

Haplotypeblocks

Haplotype Block StructureLD-Blocks, and 4-Gamete Test Blocks

Hudson and Kaplan 1985

A segment of SNPs is a block if between every pair of SNPs at most 3 out of the 4 gametes (00, 01,10,11) are observed.

0 0 10 1 11 1 01 1 1

0 0 10 1 11 1 01 0 1

BLOCK VIOLATES THE BLOCK DEFINITION

Four Gamete Block Test

Finding Recombination Hotspots:Many Possible Partitions into Blocks

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

All four gametes are present:

A C T A G A T A G C C TG T T C G A C A A C A TA C T C T A T G A T C GG T T A T A C G A C A TA C T C T A T A G T A TA C T A G C T G G C A T

Find the left-most right endpoint of any constraint and mark the site

before it a recombination site.

Eliminate any constraints crossing that site.

Repeat until all constraints are gone.

The final result is a minimum-size set of sites crossing all constraints.

Data Compression

ACGATCGATCATGAT

GGTGATTGCATCGAT

ACGATCGGGCTTCCG

ACGATCGGCATCCCG

GGTGATTATCATGAT

A------A---TG--

G------G---CG--

A------G---TC--

A------G---CC--

G------A---TG--

Haplotype Blocks based on LD(Method of Gabriel et al.2002)

Selecting Tagging SNPs in blocks

A New Measure

Informativeness

Informativeness

0 1 00 1

0 1 10 0

s

h2

h1

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I(s1,s2) = 2/4 = 1/2

Informativeness

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s1,s2}, s4) = 3/4

Informativeness

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1 s2 s3 s4 s5

I({s3,s4},{s1,s2,s5}) = 3

S={s3,s4} is a

Minimal Informative Subset

Informativeness

Minimum Set Cover= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Graph theory insight

Informativeness

Minimum Set Cover {s3, s4}= Minimum Informative Subset

s1

s2

s5

s3

s4

e1

e2

e3

e4

e5

e6

SNPs Edges

1 0 00 0

0 1 00 1

0 1 10 0

1 0 11 1

s1

s2

s3

s4

s5

Informativeness

Graph theory insight

Real Haplotype Data

Two different runs of the Gabriel el al Block Detection method +

Zhang et al SNP selection algorithm

Our block-free algorithm

A region of Chr. 22

45 Caucasian samples

When Maximum Likelihood = Bayesian = Parsimony

A C G T

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789101112131415

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

123456789

1011121314

1 2 3 4 56 7 8 9101112131415161718192021222324252627282930

1234567891011121314