The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College...

68
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College [email protected] Cold Spring Harbor Laboratory Advanced Bioinformatics course October 17, 2005
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    1

Transcript of The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College...

Page 1: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

The informatics of SNPs and haplotypes

Gabor T. Marth

Department of Biology, Boston [email protected]

Cold Spring Harbor LaboratoryAdvanced Bioinformatics courseOctober 17, 2005

Page 2: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

DNA sequence variations

The Human Genome Project has determined a reference sequence of the human genome

However, every individual is unique, and is different from others at millions of nucleotide locations

sequence polymorphisms

Page 3: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Why do we care about variations?

underlie phenotypic differences

cause inherited diseases

allow tracking ancestral human history

Page 4: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

How do we find sequence variations?

• look at multiple sequences from the same genome region

• use base quality values to decide if mismatches are true polymorphisms or sequencing errors

Page 5: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Steps of SNP discovery

Sequence clustering

Cluster refinement

Multiple alignment

SNP detection

Page 6: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Computational SNP mining – PolyBayes

2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors

sequencing errortrue polymorphism

1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources

Two innovative ideas:

Page 7: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

SNP mining steps – PolyBayes

sequence clustering simplifies to database search with genome reference

paralog filtering by counting mismatches weighed by quality values

multiple alignment by anchoring fragments to genome reference

SNP detection by differentiating true polymorphism from sequencing error using quality values

Page 8: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

genome reference sequence

1. Fragment recruitment

(database search)2. Anchored alignment

3. Paralog identificatio

n

4. SNP detection

SNP discovery with PolyBayes

Page 9: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Polymorphism discovery SW

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

Marth et al. Nature Genetics 1999

Page 10: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Genotyping by sequence

• SNP discovery usually deals with single-stranded (clonal) sequences

• It is often necessary to determine the allele state of individuals at known polymorphic locations

• Genotyping usually involves double-stranded DNA the possibility of heterozygosity exists

• there is no unique underlying nucleotide, no meaningful base quality value, hence statistical methods of SNP discovery do not apply

Page 11: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Het detection = Diploid base calling

Homozygous T

Homozygous C

Heterozygous C/T Automated detection of heterozygous positions in diploid individual samples

(visit Aaron Quinlan’s poster)

Page 12: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Large SNP mining projects

Sachidanandam et al. Nature 2001

~ 8 million

EST

WGS

BAC

genome reference

Page 13: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Variation structure is heterogeneous

chromosomal averages

polymorphism density along chromosomes

Page 14: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

What explains nucleotide diversity?

5

6

7

8

30 33 36 39 42 45 48 51 54

G+C Content [%]

SN

P R

ate

[per

10,

000

bp

]

5

6

7

8

0.3 1.2 2.1 3 3.9 4.8 5.7

CpG Content [%]

SN

P R

ate

[p

er

10,0

00 b

p]

G+C nucleotide content

CpG di-nucleotide content

5

6

7

8

9

10

0 0.5 1 1.5 2 2.5 3 3.5 4

Recombination rate [per Mb]

SN

P R

ate

[per

10,

000

bp

] recombination rate

functional constraints

3’ UTR 5.00 x 10-4

5’ UTR 4.95 x 10-4

Exon, overall 4.20 x 10-4

Exon, coding 3.77 x 10-4

synonymous 366 / 653non-synonymous 287 / 653

Variance is so high that these quantities are poor predictors of nucleotide diversity in local regions hence random processes are likely to govern the basic shape of the genome variation landscape (random) genetic drift

Page 15: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Where do variations come from?

• sequence variations are the result of mutation events TAAAAAT

TAACAAT

TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT

TAAAAAT TAACAAT

TAAAAAT

MRCA• mutations are propagated down through generations

• and determine present-day variation patterns

Page 16: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Neutrality vs. selection

• selective mutations influence the genealogy itself; in the case of neutral mutations the processes of mutation and genealogy are decoupled

functional constraints

3’ UTR 5.00 x 10-4

5’ UTR 4.95 x 10-4

Exon, overall 4.20 x 10-4

Exon, coding 3.77 x 10-4

synonymous 366 / 653non-synonymous 287 / 653

• the genome shows signals of selection but on the genome scale, neutral effects dominate

Page 17: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Mutation rate

accgttatgtaga accgctatgtaga

MRCA

actgttatgtaga accgctatataga

MRCA

• higher mutation rate (µ) gives rise to more SNPS

5

6

7

8

0.3 1.2 2.1 3 3.9 4.8 5.7

CpG Content [%]

SN

P R

ate

[p

er

10,0

00 b

p]

• there is evidence for regional differences in observed mutation rates in the genomeCpG content

SN

P d

en

sity

Page 18: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Long-term demography

small (effective) population size N

large (effective)

population size N

• different world populations have varying long-term effective population sizes (e.g. African N is larger than European)

Page 19: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Population subdivision

unique unique

shared

• geographically subdivided populations will have differences between their respective variation structures

Page 20: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Recombination

acggttatgtaga accgttatgtaga

accgttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

accgttatgtaga

accgttatgtaga

accgttatgtaga

acggttatgtaga

Page 21: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Recombination

acggttatgtaga accgttatgtaga

accgttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

acggttatgtaga

accgttatgtaga

accgttatgtaga

accgttatgtaga

• recombination has a crucial effect on the association between different alleles

Page 22: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Modeling genetic drift: Genealogy

present generation

randomly mating population, genealogy evolves in a non-deterministic fashion

Page 23: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Modeling genetic drift: Mutation

mutation randomly “drift”: die out, go to higher frequency or get fixed

Page 24: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Modulators: Natural selection

negative (purifying) selection

positive selection

the genealogy is no longer independent of (and hence cannot be decoupled from) the mutation process

Page 25: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Modeling ancestral processes

“forward simulations” the “Coalescent” process

By focusing on a small sample, complexity of the relevant part of the ancestral process is greatly reduced. There are,

however, limitations.

Page 26: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Models of demographic history

past

present

stationary expansioncollapse

MD(simulation)

AFS(direct form)

histo

ry

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 100

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

bottleneck

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

Page 27: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

1. marker density (MD): distribution of number of SNPs in pairs of sequences

Data: polymorphism distributions

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

“rare” “common”

2. allele frequency spectrum (AFS): distribution of SNPs according to allele frequency in a set of samples

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

Clone 1 Clone 2 # SNPs

AL00675 AL00982 8

AS81034 AK43001 0

CB00341 AL43234 2

SNP Minor allele Allele count

A/G A 1

C/T T 9

A/G G 3

Page 28: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Model: processes that generate SNPs

k

ii

LLL

k

k

ii

LL

k

ii

LL

k

k

iiLL

k

i

ii

i

eL

L

L

eeL

L

L

eL

L

LkP

1!

111

3

3

3

1!

11

1!

11

2

2

2

1!

11

1

1

1

23

21

3

12

2211

212

12

221

2

12

11

1111

111

1

1111

1

1111

1

computable formulations

simulation procedures

3/5 1/5 2/5

Page 29: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Models of demographic history

past

present

stationary expansioncollapse

MD(simulation)

AFS(direct form)

histo

ry

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 100

0.05

0.1

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

1 2 3 4 5 6 7 8 9 10

bottleneck

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 100

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

0

0.1

0.2

0.3

0 1 2 3 4 5 6 7 8 9 10

Page 30: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

0.005.00

10.0015.00

20.0025.00

30.0035.00

40.00

4 kb4 kb

8 kb8kb

12 kb12 kb

16 kb16kb0

0.1

0.2

0.3

0.4

• best model is a bottleneck shaped population size history

presentN1=6,000T1=1,200 gen.

N2=5,000T2=400 gen.

N3=11,000

Data fitting: marker density

Marth et al. PNAS 2003

• our conclusions from the marker density data are confounded by the unknown ethnicity of the public genome sequence we looked at allele frequency data from ethnically defined samples

Page 31: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

presentN1=20,000T1=3,000 gen.

N2=2,000T2=400 gen.

N3=10,000

model consensus: bottleneck

Data fitting: allele frequency

• Data from other populations?

Page 32: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Population specific demographic history

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

0

0.05

0.1

0.15

1 2 3 4 5 6 7 8 9 10

minor allele count

European data

African data

bottleneck

modest but uninterrupted

expansionMarth et al.

Genetics 2004

Page 33: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Model-based prediction

computational model encapsulating what we know about the processgenealogy + mutationsallele structure

arbitrary number of additional replicates

Page 34: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

African dataEuropean data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Pro

port

ion

of A

FS

Mutational Size (i)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Pro

port

ion

of A

FS

Mutational size (i)

contribution of the past to

alleles in various

frequency classes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

20,000

40,000

60,000

80,000

Mut

atio

nal A

ge (

gene

ratio

ns)

Mutational Size (i)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200

20,000

40,000

60,000

80,000

Mut

atio

nal A

ge (

gener

atio

ns)

Mutational Size (i)

average age of

polymorphism

Prediction – allele frequency and age

Page 35: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

How to use markers to find disease?

Page 36: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Allelic association

• allelic association is the non-random assortment between alleles i.e. it measures how well knowledge of the allele state at one site permits prediction at another marker site functional site

• by necessity, the strength of allelic association is measured between markers

• significant allelic association between a marker and a functional site permits localization (mapping) even without having the functional site in our collection

• there are pair-wise and multi-locus measures of association

Page 37: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Linkage disequilibrium

• LD measures the deviation from random assortment of the alleles at a pair of polymorphic sites

D=f( ) – f( ) x f( )

• other measures of LD are derived from D, by e.g. normalizing according to allele frequencies (r2)

Page 38: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

strong association: most chromosomes carry one of a few common haplotypes – reduced haplotype diversity

Haplotype diversity

• the most useful multi-marker measures of associations are related to haplotype diversity

2n possible haplotypesn

markers

random assortment of alleles at different sites

Page 39: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Haplotype blocks

Daly et al. Nature Genetics 2001

• experimental evidence for reduced haplotype diversity (mainly in European samples)

Page 40: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

The promise for medical genetics

CACTACCGACACGACTATTTGGCGTAT

• within blocks a small number of SNPs are sufficient to distinguish the few common haplotypes significant marker reduction is possible

• if the block structure is a general feature of human variation structure, whole-genome association studies will be possible at a reduced genotyping cost

• this motivated the HapMap project

Gibbs et al. Nature 2003

Page 41: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

The HapMap initiative

• goal: to map out human allele and association structure of at the kilobase scale

• deliverables: a set of physical and informational reagents

Page 42: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

HapMap physical reagents

• reference samples: 4 world populations, ~100 independent chromosomes from each

• SNPs: computational candidates where both alleles were seen in multiple chromosomes

• genotypes: high-accuracy assays from various platforms; fast public data release

Page 43: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Informational: haplotypes

• the problem: the substrate for genotyping is diploid, genomic DNA; phasing of alleles at multiple loci is in general not possible with certainty

• experimental methods of haplotype determination (single-chromosome isolation followed by whole-genome PCR amplification, radiation hybrids, somatic cell hybrids) are expensive and laborious

A

T

C

T

G

C

C

A

Page 44: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Haplotype inference

• Parsimony approach: minimize the number of different haplotypes that explains all diploid genotypes in the sampleClark

Mol Biol Evol 1990

• Maximum likelihood approach: estimate haplotype frequencies that are most likely to produce observed diploid genotypes Excoffier & Slatkin

Mol Biol Evol 1995

• Bayesian methods: estimate haplotypes based on the observed diploid genotypes and the a priori expectation of haplotype patterns informed by Population Genetics Stephens et al.

AJHG 2001

Page 45: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Haplotype inference

http://pga.gs.washington.edu/

Page 46: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Haplotype annotations – LD based

• Pair-wise LD-plots

Wall & Pritchard Nature Rev Gen 2003

• LD-based multi-marker block definitions requiring strong pair-wise LD between all pairs in block

Page 47: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Annotations – haplotype blocks

• Dynamic programming approachZhang et al.

AJHG 2001

3 3 3

1. meet block definition based on common haplotype requirements

2. within each block, determine the number of SNPs that distinguishes common haplotypes (htSNPs)

3. minimize the total number of htSNPs over complete region including all blocks

Page 48: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Haplotype tagging SNPs (htSNPs)

Find groups of SNPs such that each possible pair is in strong LD (above threshold).

CarlsonAJHG 2005

Page 49: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Focal questions about the HapMap

CEPH European samples

1. Required marker density

Yoruban samples

4. How general the answers are to these questions among different human populations

2. How to quantify the strength of allelic association in genome region

3. How to choose tagging SNPs

Page 50: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Samples from a single population?

(random 60-chromosome subsets of 120 CEPH chromosomes from 60 independent individuals)

Page 51: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Consequence for marker performance

Markers selected based on the allele structure of the HapMap reference samples…

… may not work well in another set of samples such as those used for a clinical study.

Page 52: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Sample-to-sample variability?1. Understanding intrinsic properties of a given genome region, e.g. estimating local recombination rate from the HapMap data

3. It would be a desirable alternative to generate such additional sets with computational means

McVean et al. Science 2004

2. Experimentally genotype additional sets of samples, and compare association structure across consecutive sets directly

Page 53: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Towards a marker selection tool

2. generate computational samples for this genome region

3. test the performance of markers across consecutive sets of computational samples

1. select markers (tag SNPs) with standard methods

Page 54: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Generating data-relevant haplotypes

1. Generate a pair of haplotype sets with Coalescent genealogies. This “models” that the two sets are “related” to each other by being drawn from a single population.

3. Use the second haplotype set induced by the same mutations as our computational samples.

2. Only accept the pair if the first set reproduces the observed haplotype structure of the HapMap reference samples. This enforces relevance to the observed genotype data in the specific region.

Page 55: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Generating computational samples

Problem: The efficiency of generating data-relevant genealogies (and therefore additional sample sets) with standard Coalescent tools is very low even for modest sample size (N) and number of markers (M). Despite serious efforts with various approaches (e.g. importance sampling) efficient generation of such genealogies is an unsolved problem.

N

M

We are developing a method to generate “approximative” M-marker haplotypes by composing consecutive, overlapping sets of data-relevant K-site haplotypes (for small K)Motivation from composite likelihood approaches to recombination rate estimation by Hudson, Clark, Wall, and others.

Page 56: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

M-site haplotypes as composites of overlapping K-site haplotypes

1. generate K-site sets

2. build M-site composites

M

Page 57: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Piecing together K-site sets

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

000100001101010110011111

000001010011100101110111 this should work to the degree to which

the constraint at overlapping markers preserves long-range marker association

Page 58: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Building composite haplotypes

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

0

5

10

15

20

"000" "001" "010" "011" "100" "101" "110" "111"

A composite haplotype is built from a complete path through the (M-K+1) K-sites.

Page 59: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

3-site composite haplotypes

a typical 3-site composite

30 CEPH HapMap reference individuals (60 chr)

Hinds et al. Science, 2005

Page 60: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

3-site composite vs. data

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

3-si

te c

om

po

site

)

Page 61: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

3-site composites: the “best case”

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

"exa

ct"

3-si

te c

om

po

site

)

“short-range”

“long-range”

1. generate K-site sets

Page 62: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Variability across setsThe purpose of the composite haplotypes sets …

… is to model sample variance across consecutive data sets.

But the variability across the composite haplotype sets is compounded by the inherent loss of long-range association when 3-sites are used.

Page 63: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

4-site composite haplotypes

4-site composite

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

4-si

te c

om

po

site

#2)

Page 64: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

“Best-case” 4 site composites

Composite of exact 4-site sub-haplotypes

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data)

r2 (

"exa

ct"

4-si

te c

om

po

site

)

Page 65: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Variability across 4-site composites

Page 66: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Variability across 4-site composites

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (data #1)

r2 (

dat

a #2

)

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

r2 (4-site composite #1)

r2 (

4-si

te c

om

po

site

#5)

… is comparable to the variability across data sets.

Page 67: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

Utility for association studies?

• No matter how good the resource is, its success to find disease causing variants greatly depend on the allelic structure of common diseases, a question under debate

• Regardless of how we describe human association structure, many questions remain about the relative merits of single-marker vs. haplotype-based strategies for medical association studies

Page 68: The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College marth@bc.edu Cold Spring Harbor Laboratory Advanced Bioinformatics.

http://clavius.bc.edu/~marthlab/MarthLab