Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...

33
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected] BI420 – Introduction to Bioinformatics
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...

Page 1: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Sequence Variation

Informatics

Gabor T. Marth

Department of Biology, Boston [email protected]

BI420 – Introduction to Bioinformatics

Page 2: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Sequence variations

• Human Genome Project produced a reference genome sequence that is 99.9% common to each human being

• sequence variations make our genetic makeup unique

SNP

• Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important

Page 3: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Why do we care about variations?

phenotypic differences

inherited diseases

demographic history

Page 4: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Where do variations come from?

• sequence variations are the result of mutation events TAAAAAT

TAACAAT

TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT

TAAAAAT TAACAAT

TAAAAAT

MRCA• mutations are propagated down through generations

• variation patterns permit reconstruction of phylogeny

Page 5: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

SNP discovery

• comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)

• diverse sequence resources can be used EST

WGS

BAC

Page 6: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Steps of SNP discovery

Sequence clustering

Cluster refinement

Multiple alignment

SNP detection

Page 7: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Computational SNP mining – PolyBayes

2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors

sequencing errortrue polymorphism

1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources

Two innovative ideas:

Page 8: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Computational SNP mining – PolyBayes

sequence clustering simplifies to database search with genome reference

paralog filtering by counting mismatches weighed by quality values

multiple alignment by anchoring fragments to genome reference

SNP detection by differentiating true polymorphism from sequencing error using quality values

Page 9: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

genome reference sequence

1. Fragment recruitment

(database search)2. Anchored alignment

3. Paralog identificatio

n

4. SNP detection

SNP discovery with PolyBayes

Page 10: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Sequence clustering

• Clustering simplifies to search against sequence database to recruit relevant sequences

cluster 1 cluster 2 cluster 3

genome reference

fragments

• Clusters = groups of overlapping sequence fragments matching the genome reference

Page 11: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

(Anchored) multiple alignment

• Advantages• efficient -- only involves pair-wise comparisons• accurate -- correctly aligns alternatively spliced ESTs

• The genomic reference sequence serves as an anchor• fragments pair-wise aligned to genomic sequence• insertions are propagated – “sequence padding”

Page 12: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Paralog filtering -- idea

• The “paralog problem”

• unrecognized paralogs give rise to spurious SNP predictions

• SNPs in duplicated regions may be useless for genotyping

Paralogous difference

Sequencing errors

• Challenge

• to differentiate between sequencing errors and paralogous difference

Page 13: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Paralog filtering -- probabilities

• Model of expected discrepancies• Native: sequencing error + polymorphisms• Paralog: sequencing error + paralogous sequence difference

• Pair-wise comparison between EST and genomic sequence

Paralog discrimination

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Discrepancies (d)

Pro

ba

bilit

y

P(d|Model_NAT)

P(d|Model_PAR)

P(Model_NAT|d)

• Bayesian discrimination algorithm

Page 14: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Paralog filtering -- paralogs

Page 15: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Paralog filtering -- selectivity

Distribution of P(NAT) probability values

0200400600800

10001200

0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95

P(NAT)

Num

ber

of

sequ

enc

es

375 paralogous

ESTs

1,579 native

ESTs

probability cutoff

Page 16: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

SNP detection

• Goal: to discern true variation from sequencing error

sequencing error polymorphism

Page 17: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Bayesian-statistical SNP detection

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P

)R|S(P...

)S(P

)R|S(P...

)S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

A

A

A

A

A

C

C

C

C

C

T

T

T

T

T

G

G

G

G

G

polymorphic permutation

monomorphic permutationBayesian

posterior probability

Base call + Base quality Expected polymorphism rate

Base composition Depth of coverage

Page 18: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

The SNP score

polymorphism

specific variation

Page 19: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

SNP priors

• Polymorphism rate in population -- e.g. 1 / 300 bp

• Sample size (alignment depth) Prob(k alleles of N = 20)

0

0.2

0.4

0.6

0.8

0 5 10 15 20k alleles

Pro

bp = 0.02 p = 0.1 p = 0.5

• Distribution of SNPs according to minor allele frequency

0

10

20

30

40

10 20 30 40 50

minor allele frequency [%]

rela

tive o

ccu

ren

ce [

%]

• Distribution of SNPs according to specific variation

010203040506070

AC AG AT CG

Variation type

Re

lati

ve

oc

cu

ran

ce

Page 20: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Selectivity of detection

Distribution of P(SNP) values

0

20

40

60

80

100

120

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(SNP)

Nu

mb

er

of

sit

es

76,844

SNP probability threshold

Page 21: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Validation by pooled sequencing

SNP confirmation rate

0

20

40

60

80

0.37 - 0.59 0.60 - 0.79 0.80 - 1.00

P(SNP)

Co

nfi

rmati

on

ra

te

SNPs confirmed

African

Asian

Caucasian

Hispanic

CHM 1

Page 22: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Validation by re-sequencing

0

20

40

60

80

100

51-60 61-70 71-80 81-90 91-100

SNP score [%]

Co

nfi

rmat

ion

rat

e [

%]

Page 23: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Rare alleles are hard to detect

Detection of a single allele

0

10

20

30

40

50

2 3 4 5 6 7 8 9 10

Alignment depth

Qu

alit

y v

alu

e

Threshold = 0.5 Threshold = 0.9

Quality value vs. allele frequency(alignment depth = 20)

0

10

20

30

40

50

5 10 15 20 25 30 35 40 45 50

allele frequency [%]Q

ua

lity

va

lue

Threshold = 0.5 Threshold = 0.9

• frequent alleles are easier to detect• high-quality alleles are easier to detect

Page 24: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

http://genome.wustl.edu/gsc/polybayes

Marth et al., Nature Genetics, 1999

• Available for use (~70 licenses)

• First statistically rigorous SNP discovery tool

• Correctly analyzes alternative cDNA splice forms

The PolyBayes software

Page 25: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

INDEL discovery

There is no “base quality” valuefor “deleted” nucleotide(s)

Sequencing chemistry context-dependent

No reliable prior expectation for INDEL rates of various classes

Page 26: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

INDEL discovery

Deletion Flank

InsertionInsertion Flank

Q(insertion flank) >= 35

Q(deletion flank) >= 35

Insertion Flank

Deletion FlankDeletion

Q(deletion) = average of Q(deletion flank)

Page 27: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

INDEL discovery

0

10

20

30

40

50

60

70

80

1 2 3 4 5 6 7 8 9

Insertion length [bp]

Fra

ctio

n o

bse

rved

[%

]

• 123,035 candidate INDELs (~ 25% of substitutions)

• Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%)

• Validation rate steeply increases with insertion length

14.3% 60.8% 61.7%< <

Page 28: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

SNP discovery in diploid traces

sequence is guaranteed to originate from a single location: no alignment problem

usually, PCR products are sequenced from multiple individuals

sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence

=

Page 29: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

SNP discovery in diploid traces

Homozygous trace peak

Heterozygous trace peak

Page 30: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

overlap detection

inter- & intra-chromosomal duplicationsknown human repeatsfragmentary nature of draft data

SNP analysis

candidate SNP predictions

SNP mining: genome BAC overlaps

Page 31: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

>CloneXACGTTGCAACGTGTCAATGCTGCA

>CloneYACGTTGCAACGTGTCAATGCTGCA

ACCTAGGAGACTGAACTTACTGACCTAGGAGACCGAACTTACTG

~ 30,000 clones

25,901 clones (7,122 finished, 18,779 draftwith basequality values)

21,020 clone overlaps(124,356 fragment overlaps)

507,152 high-quality candidate SNPs(validation rate 83-96%)

Marth et al., Nature Genetics 2001

BAC overlap mining results

Page 32: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

Weber et al., AJHG 2002

1. Short deletions/insertions (DIPs) in the BAC overlaps

2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries

Sachidanandam et al., Nature 2001

SNP mining projects

Page 33: Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College marth@bc.edu BI420 – Introduction to Bioinformatics.

The current variation resource

• The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers

1. How are these SNPs structured

within the genome?

2. What can we learn about the

processes that shape human

variability?