Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...

Sequence Variation

Informatics

Gabor T. Marth

Department of Biology, Boston [email protected]

BI420 – Introduction to Bioinformatics

Sequence variations

• Human Genome Project produced a reference genome sequence that is 99.9% common to each human being

• sequence variations make our genetic makeup unique

SNP

• Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

Why do we care about variations?

phenotypic differences

inherited diseases

demographic history

Where do variations come from?

• sequence variations are the result of mutation events TAAAAAT

TAACAAT

TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT

TAAAAAT TAACAAT

TAAAAAT

MRCA• mutations are propagated down through generations

• variation patterns permit reconstruction of phylogeny

SNP discovery

• comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)

• diverse sequence resources can be used EST

WGS

BAC

Steps of SNP discovery

Sequence clustering

Cluster refinement

Multiple alignment

SNP detection

Computational SNP mining – PolyBayes

2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors

sequencing errortrue polymorphism

1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources

Two innovative ideas:

Computational SNP mining – PolyBayes

sequence clustering simplifies to database search with genome reference

paralog filtering by counting mismatches weighed by quality values

multiple alignment by anchoring fragments to genome reference

SNP detection by differentiating true polymorphism from sequencing error using quality values

genome reference sequence

1. Fragment recruitment

(database search)2. Anchored alignment

3. Paralog identificatio

n

4. SNP detection

SNP discovery with PolyBayes

Sequence clustering

• Clustering simplifies to search against sequence database to recruit relevant sequences

cluster 1 cluster 2 cluster 3

genome reference

fragments

• Clusters = groups of overlapping sequence fragments matching the genome reference

(Anchored) multiple alignment

• Advantages• efficient -- only involves pair-wise comparisons• accurate -- correctly aligns alternatively spliced ESTs

• The genomic reference sequence serves as an anchor• fragments pair-wise aligned to genomic sequence• insertions are propagated – “sequence padding”

Paralog filtering -- idea

• The “paralog problem”

• unrecognized paralogs give rise to spurious SNP predictions

• SNPs in duplicated regions may be useless for genotyping

Paralogous difference

Sequencing errors

• Challenge

• to differentiate between sequencing errors and paralogous difference

Paralog filtering -- probabilities

• Model of expected discrepancies• Native: sequencing error + polymorphisms• Paralog: sequencing error + paralogous sequence difference

• Pair-wise comparison between EST and genomic sequence

Paralog discrimination

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Discrepancies (d)

Pro

ba

bilit

y

P(d|Model_NAT)

P(d|Model_PAR)

P(Model_NAT|d)

• Bayesian discrimination algorithm

Paralog filtering -- paralogs

Paralog filtering -- selectivity

Distribution of P(NAT) probability values

0200400600800

10001200

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

P(NAT)

Num

ber

of

sequ

enc

es

375 paralogous

ESTs

1,579 native

ESTs

probability cutoff

SNP detection

• Goal: to discern true variation from sequencing error

sequencing error polymorphism

Bayesian-statistical SNP detection

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic permutation

monomorphic permutationBayesian

posterior probability

Base call + Base quality Expected polymorphism rate

Base composition Depth of coverage

The SNP score

polymorphism

specific variation

SNP priors

• Polymorphism rate in population -- e.g. 1 / 300 bp

• Sample size (alignment depth) Prob(k alleles of N = 20)

0

0.2

0.4

0.6

0.8

0 5 10 15 20k alleles

Pro

bp = 0.02 p = 0.1 p = 0.5

• Distribution of SNPs according to minor allele frequency

0

10

20

30

40

10 20 30 40 50

minor allele frequency [%]

rela

tive o

ccu

ren

ce [

%]

• Distribution of SNPs according to specific variation

010203040506070

AC AG AT CG

Variation type

Re

lati

ve

oc

cu

ran

ce

Selectivity of detection

Distribution of P(SNP) values

0

20

40

60

80

100

120

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(SNP)

Nu

mb

er

of

sit

es

76,844

SNP probability threshold

Validation by pooled sequencing

SNP confirmation rate

0

20

40

60

80

0.37 - 0.59 0.60 - 0.79 0.80 - 1.00

P(SNP)

Co

nfi

rmati

on

ra

te

SNPs confirmed

African

Asian

Caucasian

Hispanic

CHM 1

Validation by re-sequencing

0

20

40

60

80

100

51-60 61-70 71-80 81-90 91-100

SNP score [%]

Co

nfi

rmat

ion

rat

e [

%]

Rare alleles are hard to detect

Detection of a single allele

0

10

20

30

40

50

2 3 4 5 6 7 8 9 10

Alignment depth

Qu

alit

y v

alu

e

Threshold = 0.5 Threshold = 0.9

Quality value vs. allele frequency(alignment depth = 20)

0

10

20

30

40

50

5 10 15 20 25 30 35 40 45 50

allele frequency [%]Q

ua

lity

va

lue

Threshold = 0.5 Threshold = 0.9

• frequent alleles are easier to detect• high-quality alleles are easier to detect

http://genome.wustl.edu/gsc/polybayes

Marth et al., Nature Genetics, 1999

• Available for use (~70 licenses)

• First statistically rigorous SNP discovery tool

• Correctly analyzes alternative cDNA splice forms

The PolyBayes software

INDEL discovery

There is no “base quality” valuefor “deleted” nucleotide(s)

Sequencing chemistry context-dependent

No reliable prior expectation for INDEL rates of various classes

INDEL discovery

Deletion Flank

InsertionInsertion Flank

Q(insertion flank) >= 35

Q(deletion flank) >= 35

Insertion Flank

Deletion FlankDeletion

Q(deletion) = average of Q(deletion flank)

INDEL discovery

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9

Insertion length [bp]

Fra

ctio

n o

bse

rved

[%

]

• 123,035 candidate INDELs (~ 25% of substitutions)

• Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%)

• Validation rate steeply increases with insertion length

14.3% 60.8% 61.7%< <

SNP discovery in diploid traces

sequence is guaranteed to originate from a single location: no alignment problem

usually, PCR products are sequenced from multiple individuals

sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence

=

SNP discovery in diploid traces

Homozygous trace peak

Heterozygous trace peak

overlap detection

inter- & intra-chromosomal duplicationsknown human repeatsfragmentary nature of draft data

SNP analysis

candidate SNP predictions

SNP mining: genome BAC overlaps

>CloneXACGTTGCAACGTGTCAATGCTGCA

>CloneYACGTTGCAACGTGTCAATGCTGCA

ACCTAGGAGACTGAACTTACTGACCTAGGAGACCGAACTTACTG

~ 30,000 clones

25,901 clones (7,122 finished, 18,779 draftwith basequality values)

21,020 clone overlaps(124,356 fragment overlaps)

507,152 high-quality candidate SNPs(validation rate 83-96%)

Marth et al., Nature Genetics 2001

BAC overlap mining results

Weber et al., AJHG 2002

1. Short deletions/insertions (DIPs) in the BAC overlaps

2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries

Sachidanandam et al., Nature 2001

SNP mining projects

The current variation resource

• The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers

1. How are these SNPs structured

within the genome?

2. What can we learn about the

processes that shape human

variability?

Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...

Documents

Transcript of Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...