Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College [email protected]...
Sequence Variation
Informatics
Gabor T. Marth
Department of Biology, Boston [email protected]
BI420 – Introduction to Bioinformatics
Sequence variations
• Human Genome Project produced a reference genome sequence that is 99.9% common to each human being
• sequence variations make our genetic makeup unique
SNP
• Single-nucleotide polymorphisms (SNPs) are most abundant, but other types of variations exist and are important
Why do we care about variations?
phenotypic differences
inherited diseases
demographic history
Where do variations come from?
• sequence variations are the result of mutation events TAAAAAT
TAACAAT
TAAAAAT TAAAAAT TAACAAT TAACAAT TAACAAT
TAAAAAT TAACAAT
TAAAAAT
MRCA• mutations are propagated down through generations
• variation patterns permit reconstruction of phylogeny
SNP discovery
• comparative analysis of multiple sequences from the same region of the genome (redundant sequence coverage)
• diverse sequence resources can be used EST
WGS
BAC
Steps of SNP discovery
Sequence clustering
Cluster refinement
Multiple alignment
SNP detection
Computational SNP mining – PolyBayes
2. Use sequence quality information (base quality values) to distinguish true mismatches from sequencing errors
sequencing errortrue polymorphism
1. Utilize the genome reference sequence as a template to organize other sequence fragments from arbitrary sources
Two innovative ideas:
Computational SNP mining – PolyBayes
sequence clustering simplifies to database search with genome reference
paralog filtering by counting mismatches weighed by quality values
multiple alignment by anchoring fragments to genome reference
SNP detection by differentiating true polymorphism from sequencing error using quality values
genome reference sequence
1. Fragment recruitment
(database search)2. Anchored alignment
3. Paralog identificatio
n
4. SNP detection
SNP discovery with PolyBayes
Sequence clustering
• Clustering simplifies to search against sequence database to recruit relevant sequences
cluster 1 cluster 2 cluster 3
genome reference
fragments
• Clusters = groups of overlapping sequence fragments matching the genome reference
(Anchored) multiple alignment
• Advantages• efficient -- only involves pair-wise comparisons• accurate -- correctly aligns alternatively spliced ESTs
• The genomic reference sequence serves as an anchor• fragments pair-wise aligned to genomic sequence• insertions are propagated – “sequence padding”
Paralog filtering -- idea
• The “paralog problem”
• unrecognized paralogs give rise to spurious SNP predictions
• SNPs in duplicated regions may be useless for genotyping
Paralogous difference
Sequencing errors
• Challenge
• to differentiate between sequencing errors and paralogous difference
Paralog filtering -- probabilities
• Model of expected discrepancies• Native: sequencing error + polymorphisms• Paralog: sequencing error + paralogous sequence difference
• Pair-wise comparison between EST and genomic sequence
Paralog discrimination
00.10.20.30.40.50.60.70.80.9
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Discrepancies (d)
Pro
ba
bilit
y
P(d|Model_NAT)
P(d|Model_PAR)
P(Model_NAT|d)
• Bayesian discrimination algorithm
Paralog filtering -- paralogs
Paralog filtering -- selectivity
Distribution of P(NAT) probability values
0200400600800
10001200
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95
P(NAT)
Num
ber
of
sequ
enc
es
375 paralogous
ESTs
1,579 native
ESTs
probability cutoff
SNP detection
• Goal: to discern true variation from sequencing error
sequencing error polymorphism
Bayesian-statistical SNP detection
Siablevarall
]T,G,C,A[S ]T,G,C,A[SiiiorPr
iiorPr
i
iiorPr
i
NiorPrNiorPr
NN
iorPr
i Ni
N
N
N )S,...,S(P)S(P
)R|S(P...
)S(P
)R|S(P...
)S,...,S(P)S(P)R|S(P
...)S(P)R|S(P
)SNP(P
1
1
1
1 11
11
11
A
A
A
A
A
C
C
C
C
C
T
T
T
T
T
G
G
G
G
G
polymorphic permutation
monomorphic permutationBayesian
posterior probability
Base call + Base quality Expected polymorphism rate
Base composition Depth of coverage
The SNP score
polymorphism
specific variation
SNP priors
• Polymorphism rate in population -- e.g. 1 / 300 bp
• Sample size (alignment depth) Prob(k alleles of N = 20)
0
0.2
0.4
0.6
0.8
0 5 10 15 20k alleles
Pro
bp = 0.02 p = 0.1 p = 0.5
• Distribution of SNPs according to minor allele frequency
0
10
20
30
40
10 20 30 40 50
minor allele frequency [%]
rela
tive o
ccu
ren
ce [
%]
• Distribution of SNPs according to specific variation
010203040506070
AC AG AT CG
Variation type
Re
lati
ve
oc
cu
ran
ce
Selectivity of detection
Distribution of P(SNP) values
0
20
40
60
80
100
120
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
P(SNP)
Nu
mb
er
of
sit
es
76,844
SNP probability threshold
Validation by pooled sequencing
SNP confirmation rate
0
20
40
60
80
0.37 - 0.59 0.60 - 0.79 0.80 - 1.00
P(SNP)
Co
nfi
rmati
on
ra
te
SNPs confirmed
African
Asian
Caucasian
Hispanic
CHM 1
Validation by re-sequencing
0
20
40
60
80
100
51-60 61-70 71-80 81-90 91-100
SNP score [%]
Co
nfi
rmat
ion
rat
e [
%]
Rare alleles are hard to detect
Detection of a single allele
0
10
20
30
40
50
2 3 4 5 6 7 8 9 10
Alignment depth
Qu
alit
y v
alu
e
Threshold = 0.5 Threshold = 0.9
Quality value vs. allele frequency(alignment depth = 20)
0
10
20
30
40
50
5 10 15 20 25 30 35 40 45 50
allele frequency [%]Q
ua
lity
va
lue
Threshold = 0.5 Threshold = 0.9
• frequent alleles are easier to detect• high-quality alleles are easier to detect
http://genome.wustl.edu/gsc/polybayes
Marth et al., Nature Genetics, 1999
• Available for use (~70 licenses)
• First statistically rigorous SNP discovery tool
• Correctly analyzes alternative cDNA splice forms
The PolyBayes software
INDEL discovery
There is no “base quality” valuefor “deleted” nucleotide(s)
Sequencing chemistry context-dependent
No reliable prior expectation for INDEL rates of various classes
INDEL discovery
Deletion Flank
InsertionInsertion Flank
Q(insertion flank) >= 35
Q(deletion flank) >= 35
Insertion Flank
Deletion FlankDeletion
Q(deletion) = average of Q(deletion flank)
INDEL discovery
0
10
20
30
40
50
60
70
80
1 2 3 4 5 6 7 8 9
Insertion length [bp]
Fra
ctio
n o
bse
rved
[%
]
• 123,035 candidate INDELs (~ 25% of substitutions)
• Majority 1-4 bp insertion length (1 bp – 68 %, 2bp – 13%)
• Validation rate steeply increases with insertion length
14.3% 60.8% 61.7%< <
SNP discovery in diploid traces
sequence is guaranteed to originate from a single location: no alignment problem
usually, PCR products are sequenced from multiple individuals
sequence is the product of two chromosomes, hence can be heterozygous; base quality values are not applicable to heterozygous sequence
=
SNP discovery in diploid traces
Homozygous trace peak
Heterozygous trace peak
overlap detection
inter- & intra-chromosomal duplicationsknown human repeatsfragmentary nature of draft data
SNP analysis
candidate SNP predictions
SNP mining: genome BAC overlaps
>CloneXACGTTGCAACGTGTCAATGCTGCA
>CloneYACGTTGCAACGTGTCAATGCTGCA
ACCTAGGAGACTGAACTTACTGACCTAGGAGACCGAACTTACTG
~ 30,000 clones
25,901 clones (7,122 finished, 18,779 draftwith basequality values)
21,020 clone overlaps(124,356 fragment overlaps)
507,152 high-quality candidate SNPs(validation rate 83-96%)
Marth et al., Nature Genetics 2001
BAC overlap mining results
Weber et al., AJHG 2002
1. Short deletions/insertions (DIPs) in the BAC overlaps
2. The SNP Consortium (TSC): polymorphism discovery in random, shotgun reads from whole-genome libraries
Sachidanandam et al., Nature 2001
SNP mining projects
The current variation resource
• The current public resource (dbSNP) contains over 2 million SNPs as a dense genome map of polymorphic markers
1. How are these SNPs structured
within the genome?
2. What can we learn about the
processes that shape human
variability?